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Appropriate Microcomputer Item Analysis 
for Domain-Referenced Classroom Testing 

Anthony J. Nitkb and Tse-chi Hsu 
School of Education 
University of Pittsburgh 

: This paper describes item analysis procedures appropriate for 
domain-referenced classroom testing and how these procedures can be 
implemented with a microcomputer program. First, it presents a con- 
ceptual framework within which teachers' informational needs and item 
statistics can be considered. Second, we review approximately fifty 
item statistics, using logical analysis and Monte Carlo sampling 
studies to ultimately recommend several statistics to be incorporated 
into a microcomputer program for classroom teachers. 
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Item Analysis Appropriate for Domain-Referenced 
Classroom Testing* 
by 

Anthony J. Nitrko and Tse-chi Hsu 
School of Education 
University of Pittsburgh 

The purpose of this paper is to describe the kinds of item analysis 
information useful for domain-referenced classroom testing. The paper is 
organized in the following way. First we present a conceptual framework 
within which item statistics can be considered. Second, we review promising 
statistics in light of this framework, third, we examine the sampling 
fluctuations of several of the more promising item statistics for sample 
sizes comparable to what we would expect the typical classroom size to be. 
Fourth, we recommend several statistical indices that are the most promising 
ones to use in an item analysis package programmed for an Apple II Plus 
microcomputer. 

The reader of this report should keep in mind several points. First, 
the item analysis procedures and statistics recommended in this report are 
constrained by the practical limits of schools , of teachers 1 experience arid 
time, and of the capacities of a particular microcomputer. Second, the 
primary functions of an analysis of pupils' responses to test items are to 
assist a teacher in (a) making instructionaiiy relevant decisions and (b) 
improving the technical quality of the test items used. In this item analysis 
process , the teacher is encouraged to use the computer as a tool and no 

*We are deeply indebted to Dr. Huynh Huynh for his valuable assistance in 
providing us with the Monte Carlo sampling data presented in the third 
section and for his immense contribution in clarifying our thinking about 
the item statistics reviewed in this report. We are solely responsible, of 
course, for all errors and misconceptions that remain herein. 
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attempt is made to use item statistics to create a computer-a?oisted "teacher 
proof* 1 system of item analysis; Third, it is recognized that selecting 
appropriate item statistics means not simply focusing on the quality and use- 
fulness of the statistics qua statistics, but also means considering the 
appropriateness of the statistics in terms of the understanding arid interpre- 
tation that teachers are able to give them. Fourth f the appropriate number arid 
the presentation of statistics is important to their use by teachers. If .too 
many statistical indices are provided simultaneously and in an "unfriendly" 
format, a teacher will be confused. Thus, although we recommend quite a few 
statistical indices in this report, we do not recommend that these statistics 
be reported simultaneously in uninterpreted form. Designing an item analysis 
microcomputer program is in part a human engineering problem. Fifth, a micro- 
computer program that computes the recommended statistics should present the 
information to the teacher in a way that will facilitate interpreting the 
teacher's particular classroom data. Sometimes this means simply displaying 
the numerical value of a statistical index. At other times it will mean pro- 
gramming decision rules into the computer that will recommend certain teacher 
actions or certain teacher options. Sixth, it should be noted that all of the 
statistical indices we recommend in the last section should be available to a 
teacher upon request, even if they are not displayed initially. Thus programming 
techniques should be used that will permit a teacher to dip deeper J .nto the 
data arid to obtain the actual numerical values of the indices, if rlesired. 

Frameword for Considering Item 
Analysis and Item Statistics 
There are important differences between using tests to measure pupils and 
using tests to improve the pupils f instruction: Whereas measurement seeks riot 
to alter the characteristic being tested, instruction explicitly seeks to change 
the pupil so that eventually every test item in the domain can be answered cor- 
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rectiy (cf. Lord, 1970). In order for tests to be effective as classroom 
instructional tools, however, it is necessary to integrate them into the 
instructional decision-making process. This means that teachers have to 
design tests for the decisions for which they will be using them. This 
assures that the test information Has a reasonable chance of being useful. 

The term domain-referenced test is broadly defined to mean a test that 
is built so that scores on it can be referenced to a well-defined class of 
domain of behaviors in a way that permits an examinee's status on that domain 
to be estimated. This is a broad definition of domain-referencing and there 
is little difference between it and criterion-referencing as this latter term 
has been recently explicated (Nitko, 1980). Both concepts essentially mean 
the same thing, requiring a well-defined class of tasks or behaviors to which 
test performance can be referenced. Most persons prefer the term criterion- 
referencing (Popham, 1978; Hambleton, Swaminathan, Algina, & Coulson, 1978). 

Classifications of domain-referenced tests such as that presented by 
Nitko (1980) are likely to be unfamiliar to teachers. However, teachers can 
be encouraged to view their own tests in this broader context. Most teachers 1 
tests are of the unordered variety, being built on the basis of verbal state- 
ments of stimuli and responses (i.e., behavioral objectives) and sometimes on 
the basis of diagnostic categories of pupil difficulties. But at least for 
some classroom decisions, such as grouping students or distinguishing among 
degrees of mastery of a topic, ordered domains may be more appropriate. This 
means, for example, that a teacher's interpretation of item analysis and other 
test statistics will depend on the type of domain-referenced test being built > 
as well as on the type of decision for which the test information will be used. 
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the workable level of specificity for domain definition is likely to be 
the behavioral objective. Teachers can use behavioral objectives to brgar ize 
and direct their instruction. Currently many training programs teach teachers 
how to write objectives arid use them for instructional design. Further, many 
school districts define their curriculum using objectives. Thus, for most 
teachers, domain-referenced classroom testing is likely to center around 
behavioral objectives at this point in time. 

The responses of students to the items bh a classroom test provide a teacher 
with information in three broad arid interrelated areas: (1) improving arid 
guiding instruction, (2) editing 'and improving individual test items, and (3) 
improving the properties of the total test score for certain decision-making 
purposes. Pupils 1 responses to stimulus material a teacher presents for pur- 
poses of evaluation provide clues concerning what pupils have learned , the 
extent to which tha material has been learned, and the nature of pupils 1 errors 
and misunderstandings. Item analysis can provide a teacher with valuable 
summary information about the class of pupils, as well as identify pupils who 
respond in unusual ways to the stimulus material. Such ins true tionally rele- 
vant information when brought to the attention of a teacher can provide the 
basis for instructional planning. 

Second, pupils 1 responses to test items provide valuable information about 
how the individual test items are functioning, Test ito.ms should be designed 
to elicit certain important pupii responses that a teacher can use to decide 
whether learning has occurred. Viewed in this way, a test item and its parts 
have very specific functions. Data about the test item and its parts can be 
analyzed and used to decide whether these functions are being fulfilled. As 
an example, consider the alternatives of a multiple-choice item. Data can be 



gathered to provide a teacher with information about such matters as whether 
less knowledgeable students are attracted to incorrect responses arid whether 
two or more alternatives appear to be ambiguous to the more knowledgeable 
pupils. Additional information about how an item has functioned and what 
might be done to improve it can be provided, of course. 

A third area in which item statistics can be helpful is in suggesting 
ways for improving the entire collection or ensemble of test items that com- 
prise a particular test. Each item contributes to the score on the total test 
in well known ways. Thus* the entire test is dependent on the properties of 
the individual items. What is considered to be the desirable properties of 
the total* test, on the other hand, depends on the particular purposes or 
decisions for which that test score will be used. A test may be used, for 
example, to estimate a domain score without reference to the performance of 
other pupils in the class. Or, the test score may form the basis to rank or 
order pupils for purposes of assigning letter grades or for forming subgroupings 
of pupils for instructional purposes. Tests with such diverse purposes will 
have different properties. The items comprising the tests will need to exhibit 
different properties as well. Thus, statistics computed in an item analysis 
microcomputer program will need to fit into the purposes for which the teacher 
will use the total test scores. 

The tLi.ee broad areas and the specific kinds of information needed under 
each area are listed in Table 1. The specific information is discussed in 
the sections which follow. As can be seen from a perusal of the table ^ the 
three areas are interrelated and specific information in one area may often 
be used for a related purpose in another area. In the discussion which follows, 
each kind of specific information is discussed separately, However, the 
reader should keep in mind their interrelationships. 
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Insert Table 1 here 

Item Analysts Information Useful for Guiding Instruction 

Unless otherwise noted, the descriptions in this section refer to in- 
formation that is provided for each test item, rather than for the total test 
or for clusters of test items. 

1. S ummary of how the class per for ted. Each test item is intended to 
measure knowledge or application of an important fact, concept, or principle 
that a teacher has taught. Often, such knowledge and/or application is pre- 
requisite to the next unit or step in an instructional sequence. It is usefu 
therefore, for a teacher to know the extent to which the class as a whole has 
acquired this knowledge or skill since future instructional planning can be 
informed by such information. 

2 . Discrepancy- between a teacher's expectation of a class' performan ce 
and the actual performance of the class . An important kind of information fd 
a teacher is whether the class performed as the teacher expected. To conduct 
instruction effectively a teacher heeds a .good sanse of whether students are 
behaving in expected ways. Teachers often do have informal, implicit expecta 
tibris about the number or percent of students who would be expected to have 
learned certain concepts at particular points in time. Ah item analysis 
program can arid should compare a teacher's expectations with actual student 
performance and alert the teacher to confirmations or discrepancies. Know- 
ledge and/or skill areas ih which students perform significantly worse than 

a teacher expects can serve as the basis for planning remedial instruction, 
while areas in which students perform better than expected can reinforce a 
teacher f s self-concept in relation to teaching skill and serve to raise a 
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Table i; Three broad areas and the specific information heeded about the 
items comprising a domain-referenced classroom test* 



Areas in which information will be needed for each item: 
Improving and guiding- Rewriting individual Selecting items to 

Instruction test items put on att est 
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2. Item difficulty 




the performance a 
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level. 




teacher expected of 
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the class and the 




an item. 


to test blueprint 




actual class per- 
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Item difficulty level. 


and /or domain 
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Item discrimination 


specification. 
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Unusual performance 




level. 


4. Estimated total 




of a student on an 


6. 


Identification of 


test properties 




item. 
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Hierarchical ordering 
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test. 




test. 


8. 


Identification of 
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Change in a class 1 




miskeyed items , 






performance after 
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Identification of 






instruction. 




patterns of guessing 




6. 


Summary of the 
seriousness of pupils T 




among knowledgeable 
students. 





errors oh an item. 



7. Summary of the types 
of errors pupils 
committed on an item. 
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teacher's expectations for students in subsequent learning, 

3. Unusual student performance in relation to a particular item . 

As students interact with test materials they carl be expected to behave in 
a rather consistent mariner. Students with less knowledge and skill can be 
expected to score poorly on difficult test items, bat to score better on 
relatively easy items. Similarly, students with a good command of the sub- 
ject can be expected to do well on both easy and relatively more difficult 
items. When a student with good ability does poorly on a relatively easy 
item, or when a student with poor command of knowledge or skill in an area 
gets a rather difficult item correct , these situations should be brought to 
the teacher's attention for explanation and consideration for possible action. 

In like manner, test items themselves can exhibit unusual patterns. 
Identifying these patterns could alert the teacher to items that for some 
reason do not "fit-in" with the majority of items. These items may be in 
need of revision or the items may identify instructional areas that need 
attention. 

4. Hierarchical ordering among the items in a particular test . It would 
be useful for planning instruction and diagnosis if a teacher knew whether a 
hierarchical structure existed among the items in a test. Depending on the 
nature of the skills and concepts included on the test, the identification 

of a hierarchy among the items could help a teacher plan diagnosis and 
remedial instruction by suggesting sub concept arrangements of suggesting 
a possible order in which concepts could be taught. 

5 . Changes in a class' performance as a result of in^^ruc-t^orir . 
Occasionally, a teacher may use a pre-instructional test (pretest) and a 
post-instructional test (posttest) containing identical (or equivalent) 



test items, In such cases; it is important for the teacher to know the 
items on which the class 1 performance changed; and the extent and direction 
of that change. An item that tests a particular concept or skill and for 
which there is little or no change as a result of instruction or for which 
performance after instruction is worse than before instruction, may indicate 
that the ins true tibh was ineffective or unnecessary, the item was poorly 
written, or the students had responded indifferently. In any event, if 
pretest and posttest data are available, an item analysis program should 
analyze it and permit the teacher who so desires to consider its implications 
for instruction. 

6. Summary of seriousness of pupil errors on ah item . Pupils who 
answer an item completely incorrectly or receive less than full credit Sri 

an item commit errors of various degrees of seriousness. It would be helpful 
to instructional planning if a teacher had, for each test item, a summary of 
the seriousness of errors committed. This summary could help^ for example - 9 
in setting instructional priorities. Such information could be obtained, 
however > drily if the teacher could codify or rate the seriousness of errors 
of each student . 

7. S umma ry of types of pupil errors on an item * Related to Point 6 
above , the type or kind of error committed is useful information as well. 
A teacher would need to classify the various types of errors that could be 
committed by students (presumab ly known by a teacher from past experience) 
arid then iri some way identify for each student the type(s) of error committed. 
This may be a tedious and, therefore, impractical task for a teacher unless 
(a) the number of error types is small or (b) the options on a multiple- 
choice test are specifically written to attract students who commit specific 
error types. In the former case , it is conceivable that 3 or 4 coarse types 



of errors which could be found on any item on the test could be identified. 
For example, in social studies, errors could be classified as (a) incorrect 
reasoning, (b) incorrect knowledge of a concept or principle, (c) lack of 
knowledge of an important fact, and (d) spelling errors; Each student's 
response is graded in the normal fashion and, in addition, items with less 
than perfect responses are coded according to one or more of these error 
categories. (If the above illustrated error categories were arranged in 
order of seriousness, then both information Points 6 arid 7 could be handled 
simultaneously.) In the case cf multiple-choice items, a similar categori- 
zation could result, except that more error types could be identified because 
the microcomputer could automatically classify pupils' responses into various 
error types. 

Item Analysis Information Useful for Editing and Revis ing Items . 

Certain kinds of information can be obtained from pupil responses to 
classroom test items that will suggest possible flaws in the items. Items 
exhibiting patterns of pupil responses suggestive of flaws could be flagged 
and brought to the teacher's attention. Some types of information useful 
for revising items may come from a closer examination of the items by the 
teacher, rather than analysis of pupil responses per se. Both types of 
information are described below. 

if Extent of item-objective c ongruenc e . An essential part of any review 
of classroom test items is the extent to which each item corresponds to the 
instructional objective it is intended to measure. This information can be 
obtained either from the judgment of an individual teacher of from the 
pooled judgments of a group of teachers. The former is likely to be the 



typical source of irif orraatiori, while ths latter is likely to be obtained 
when committees of teachers form cbtnmbh tests or when a school district 
uses an objective-based mastery learning system; In the latter situation, 
individual teachers usually do not construct their own test items: Often, 
the items are purchased or developed by outside agencies on a contract basis. 

2 . Congruence of test items to instructional events in a classroom. 
It is important that an item correspond to what a teacher has taught or the 
pupils were supposed to study. It often happens when test items are purchased 
(or provided as part of an instructional materials package) that they 
correspond to written statements of objectives but not to the precise manner 
in which students were taught to respond in the classroom. For example, 

a publisher^rovided test item in history may emphasize a different inter- 
pretation than occurred in the classroom or a science item may illustrate a 
principle with a different experiement than a teacher used. In such cases, 
teacher revisions can "fine-tune" an item to make it a more valid measure 
of a pupil's learning. 

3. Vocabulary appropriate to the level of the students- . Items written 
by persons who lack daily contact with the particular pupils being tested 
may contain phrases and vocabulary words that are not appropriate and thereby 
interfere with pupils' ability to express the knowledge they have acquired* 
For example, in a language development curriculum in a junior high school, 
students may learn the definitions of new vocabulary words through class 
discussion and writing sentences using their current vocabulary zind language 
level. A teacher, however, may elect to use items on a mastery test that 
were provided by a textbook publisher. Such items may be multiple-choice 
arid, conceivably, their alternatives could contain vocabulary words that are 
beyond the language development level of the students being tested. Thus, 
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although students may have learned the specific words they were taught 
they could not demonstrate this knowledge to the teacher. 

The information about the vocabulary level of the wording in an item 
can be obtained either by judgments from one or more teachers, or by checking 
an item against a specific vocabulary list. 

4. Difficulty level of the item for students . Ah item that too few 
students answer correctly may be flawed in some way arid hence should be 
revised. But difficulty level alone is riot a sole criterion for revision, 
since a test item may be well-written but the students may not have learned 
the requisite material. Similarly, items that are too easy may reflect good 
pupil learning or an item that is too obviously correct to pupils. In either 
case — flawed items or reflection of the learning status of students—the 
difficulty level of the item contains important information. 

5. The discrimination level of an item . Items for which the lower scoring 
pupils on a test do better than the high scoring pupils need to be examined 

for possible flaws, since these items function in a manner that is in opposi- 
tion to the bulk of the items in the test. Similarly, items which do not 
distinguish the more able from the less able should be examined in at least 
a cursory manner to assure that they are properly written. As with all the 
information in this section, the purpose is to identify items that may be in 
need of revision, rather than to collect information for purposes of culling 
and selecting items. 

The following information can be collected only for true-false, matching, 
and multiple-choice items. 

6. Identification of poor distractors . The distractors of a multiple- 
choice item function as plausible choices for the students who have not 
acquired the knowledge required to answer the item correctly. Empirical data 
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from pupils can identify items that are not functioning in this way. 

7. Identification of ambiguous alternatives . In this context > two 
alternatives are ambiguous if students who know the material ah item is 
supposed to test, tend to have difficulty deciding which of the two alternatives 
is the correct answer. 

8. Identification of misfeey ed items . Occasionally a teacher inadvertantly 
miskeys a multiple-choice item. Data from pupil responses are examined in 
relation to the teacher-keyed answer. If the more knowledgeable students 
choose an incorrect alternative in large numbers, the items may have been 
miskeyed. Flagging such items bring them to the attention of the teacher. 

9. Identification of items for whi^r-ga iidom gue ssing may b e occurring 
a5ong the more knowledgeable students. The more knowledgeable students are 
expected to have acquired the information or skill on which an item is based. 
Studying the response patterns of this group of students may reveal that they 
are not responding in the expected manner. If so, such items should be flagged 
and reviewed by the teacher. 

Improving Properties of the Total Test* 

The properties of the total test are a function of the properties of the 
items comprising the test. Therefore - t it is important that a teacher attend 
to certain item properties when assembling a test t The properties of the 
items to which a te.icher should attend depend to a considerable extent on how 
the total test score will be used — that is, on the decisions for which the 
teacher will use the scores. 

In general, classroom tests tend to be used for decisions that require 
one of the following: (a) complete ordering cf students, (b) partial ordering 

of students, and (c) ascertaining the domain status of students. Ranking 

*We have limited this technical report to a discussion of the item statistics 
only although we recognized that a complete item analysis computer pa :kage 
should compute total test score statistics (e.g., mean, standard deviation* 
median, and various reliability indices) , compute percentile ranks (perhaps 
standard scores), and tabulate a frequency distribution . 

is 



students on a test and grading on the curve are examples of test usages 
requiring complete ordering of students. Some uses to which test scores 
are put require partial ordering, as when a teacher seeks to place students 
into two groups — for example, better readers arid less able readers — with 
the intention of treating individuals within each group in approximately 
the same way. (All students iri the better readers group , for example, may 
be permitted to proceed with new material, while students iri the lower group 
are given the same remedial instruction.) 

A teacher seeks an estimate of a pupil's domain status when a decision 
depends on a person's domain score without regard to the domain scores of 
other pupils. Estimates of domain status are usually expressed in terms of 
a percent or f faction of the domain a student knows. Estimates of domain 
status are of concern when instructional decisions depend on absolute achieve- 
ment rather than relative achievement. A decision about an individual 
student T s mastery of an instructional objective, for example, is often based 
on an estimate of that student 1 s domain status ; A student is declared to have 
achieved sufficient mastery if the student scores high enough on a test 
measuring that objective. 

Keeping in mind the distinctions between absolute and relative achieve- 
ment - 9 and between partial and complete ordering of students, the following 
types of information about individual test items seem important for class- 
room test development. The reader should note that, as with other types of 
information in item analysis, the interpretation of statistical indices of 
this information will require programming into the microcomputer certain 
rules of thumb to assist in the decision-making process. 
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1, The extent to which test items discriminate among students . 
Regardless of whether a teacher uses a test to measure relative or absolute 
achievement > the items bh the test should contribute information to the 
total test score in the same algebraic direction. That is, as a group, 

the higher scoring pupils on the test should have a rather high probability 
of answering correctly each test item. (This is not to say that each pupil 
in the higher scoring group will definitely answer correctly every test 
item, only that there is a propensity to do so.) When a larger proportion 
of higher scoring students than lower scoring students answer an item 
incorrectly, a teacher T s interpretation of the total test score becomes 
confused: These negatively discriminating items tell a teacher that the more 
a student knew (as reflected by the test score) the less are the chances of 
answering the items correctly. Negatively discriminating items should be 
examined by a teacher and either revised or not put on the same test with 
the positively discriminating items. 

A decision about which of the positively discriminating items to place on 
a test depends on (a) the type of achievement being measured (absolute or 
relative), (b) the nature of the test specifications* (c) the type of decision 
to be made, arid (d) other properties of the items -> such as their difficulty 
levels, arid (e) the type of statistical index used to summarize discrimi- 
nation. These factors are considered in subsequent sections. 

2. Dif f icuXty^ of the item for th e cl a s s^ As we have described pre- 
viously, item difficulty plays a role in both improving the effectiveness 
of instruction arid in revising a test item. Item difficulty also plays a 
role in assembling a test since the difficulty levels of the individual 
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items comprising a test set the difficulty level of the total test. Item 
difficulty level also sets limits on item discrimination and bh the total 
test reliability. When a test score will be used for partial or complete 
ordering, item difficulty plays a role in helping to establish the ability 
level at which a test is most reliable, 

3- Relation of an item to the test blueprint and/or the dogates This 
information is a judgment of the item-domain congruence and /or an indication 
of where an item fits in the test blueprint (plan). The item-domain con- 
gruence judgments have been described earlier in this report. The second 
type of information is important to the assembly of a class robin test in that 
it assures the items on the test have sufficient content scope and behavioral 
breadth for the total test to be content valid. 

4. Projection of statistical properties of the total test from the 
pro perties of the individual I temSn. If a teacher is assembling a test by 
selecting items from a pool of previously used items that have known sta- 
tistical properties, it would be helpful for the teacher to know what to 
expect in the way the total test will perform. At the minimum, it would be 
helpful to obtain an estimate of the mean of the test. Other information 
may be an estimates of the test reliability and standard deviation. This 
total test information can be estimated from the statistics available on each 
item. If the items to be used come from an item bank that has been calibrated 
using a latent trait model, then other total test properties can be described 
such as the part of the ability continuum on which the assembled test pro- 
vides the most information. 
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Review of Statistical indices Having Potential 

for Providing the Information Needed 

for Domain-Referenced Classroom -Tests 
Having set out in the previous section the information requirements of 
domain-referenced classroom item analysis, wo turn our attention to specific 
statistical indices which could provide these kinds of x Formation. In 
this section, we will return to each area previously described,, but limit 
the discussion primarily to various statistical indices, 
Improving and Guiding -Instruction 

In this section we review a number of item statistics that have potential 
for providing the classroom teacher with the specific kinds of information 
listed in Table 1 for improving and guiding instruction. By and large, the 
statistics we review here are considered without regard to their sampling 
errors t Sampling errors are important to consider in selecting statistics 
when inferences are made about estimating population parameters or when one 
seeks to understand the stability of a numerical result when a replication 
is important such as in an experiment or survey. The numerical values of the 
indices discussed in this section, when used by the teacher, will be based on 
a specific set of students and, therefore, when recomputed on data from a new 
group of students, will likely yield a different; numerical value. However^ a 
teacher is interested primarily in working with the group of students at hand 
at any particular time. Thus, sampling fluctuations are less of concern when 
the statistical information is to be used to change the students iri the sample 
in some way, CSampling fluctuations are more of a concern, however, when 
using item statistics to revise or select items for a test,) 

Table 2 provides a list and a brief description of several statistical 
indices having some potential for providing item data that will serve the 
various information needs of teachers and which may possibly be computed on a 
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microcomputer. Below we will describe each of these statistics in more 
detail, pointing to the advantages and disadvanatages of providing them as 
part of a microcomputer item analysis program for classroom teachers. 



Insert Table 2 here 



1. Statistical summary of the class' performance on a test item , table 2 

lists six statistical indices Which are defined as follows: 

N 

- *ai 

P i = ' Y ai - °> 1 HI 



M 

- Y ai 

\ - * = n » m i ± Y ai ± X i 



Y ± - m 



N - 2 

Z (Y - Y.) 

V - a « i a± 1 



i N 



[4] 



Y li' Y 2i' V ' m i " Y ai < h ^ 

P li' P .2i' P ji' ' Y ai" °* 1 f 5S 3 



S3 
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Table 2, Statistical item data potentially useful for helping a teacher 
improve and guide instructicn. 



Type of information a teacher 
could use 



Possible statistical indices 



1. Summary of the performance 
of the class on each item. 



The fraction or percent of 
the entire class passing a 
dichotomously scored item. 
The mean item score of the 
entire class for an item 
that is scored in a graded 
or continuous way. 
The mean item score, y.^, 
expressed as a percent of 
the maximum possible item 
score . 

A measure of the variability 
of the item scores of the 
entire t^iass for an item 
scored i*i a graded or 
cbntinubuJ way, 
A function *"hat displays 
average item score for each 
of j levels of the total test 
score. 

The item characteristic curve 
fo<- a dichotomously scored item. 
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Table 2 (cont.) 

The difference between the 
percent of the entire class 
actually passing a dichotom- 
busly scored item arid the 
percent of the class the 
teacher expected to pass the 
item. 

The difference the actual mean 
item score of the entire class 
arid the mean item score the 
teacher expected the class 
to obtain. 

t>3 ± = - EP ± Similar to the above difference 
except is the mean item 
score expressed as a percent 
of the maximum possible item 
score. 

D4^ = p^ - ^ The difference between the 
percent of the entire class 
actually passing a dichotomous 
item arid the estimated percent 
passing the same item in a 
suitable norm group (e.g., 
percent passing in the district 
or percent passing in the state). 



2. Discrepancy between the Dl^ = - Ep^ 
performance a teacher 
expected of the class 
and the actual per- 
formance of the class 
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Table 2, (cont.) 



D5 ± = P(x j ) 1 



EP(x j ) i 



D6, 



A ± /M ± 



Difference between the 
actual average item score 
and expected average item 
score for each of j levels 
of the total test score. 
Ratio of actual discrepancy 
to maximum discrepancy 
between students' choices 
among options of a multiple- 
choice item (Huynh, 1983). 

Modified caution index of 
Hamisch and Linn (1981). 

Personal biserial corre- 
lation (Dbhlbn & Fisher, 
1968), The biserial 
corf elation between a per- 
son's item responses and 
the difficulties of the 
corresponding items, 
assuming a normal distri- 
bution underlying ths item 
responses . 



Unusual pattern of 
responses on a test 
for a student 



a parbis 
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Table 2. (cbrit.) 



-r c; - Persnnal pbirit-biserial 

a perptb±s 



correlation (Brennan , 1980 , 
cited in Rarnisch & Linn, 1981) 
The product moment correlation 
between the item scores for 
a person and the corres- 
ponding item difficulties. 



NCI Norm conformity index 

a 



(Tatsuoka & Tatsuoka, 1982), 
a measure of the degree of 
consistency between an 
individual's response 
pattern and the ordering 
of the items in a norm 
group . 

v Person fit statistic for 

a 

Rasch model (Wright and 
Stone, 1979). 



4. Hierarchical ordering IRSA matrix Item relation structure 

of the items on a test. analysis matrix (Tatsuoka 

S Tatsuoka 1981) is used 
as a basis for ordering 
the test items in a hier- 



archical directed graph. 



23 

Table 2. (cbrit.) 



5. 


Change in a class' 


iD- 

i postpre 


Difference between the per- 




performance after 


i post i pre 


cent of pupils answering the 




instruction. 


-D = p 

± ingain r 01 


item correctly before and 
after instruction (Gox & 
Vargas , 1966). 
Percent of students who 
answered the item incorrectly 

posttest (Roudabush, 1973). 


6. 


Summary of the 
seriousness of pupils' 




Proportion of students 
committing each seriousness 




errors on ah item 




level of error, r ^~* 

Mean rating of the seriousness 

of errors for the entire 

O'Tnitn nf fittiHpnt!^ on <i oar— 

ticular item. 


7. 


Summary of the types 




Proportion of students 




of errors committed 




committing each type of error, 




on an item. 
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r 



Da i (0 - b i ) 



1 + e 



Da ± (0 - b ± ) 



i + 2 



Da ± (0 - b ± ) 



Ba^CQ - b ± ) 



1 + e 



' Y a± S °* 1 



[6J 



[7J 



[8] 



Where in the above formulas: 

N » number of students taking the test 

Y 33 the score of the ath student bh the ith item on the 
ai — — 

test 



m^ *» the lowest possible score a teacher could assign on the 
ith- item 

1^ = the highest possible score a teacher could assign on the 
ith item 

Yj^. * the mean score of the j th subgroup of the class of students 

on the ith item (e.g., the lower third) 

p * the percent of the jth subgroup of the class of students 
ji 

answering the ith item correctly 
a^, b^, c^, e - the parameters and constants of the family of latent trait 

models based on a logistic ogive (see, e.g., lord, 1980) 



On 
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We note Immediately that if Y - is a score bri ah item graded zero or 

a 1 

one, then ^ =» ?^ = P^ When items are scored in a more continuous fashion, 
p^ is not used and ^ P^. 

Ah advantage of using statistics [1], [21, or [3] is that they provide 
a single summary number that can capture the performance of a class of 
students on a particular test item. A disadvantage, of course, is that these 
statistics do riot provide a summary of how different types or groups of 
students performed: for example, how the lower third of the class performed 
compared to how the upper third performed. Thus, some information that is 
possible to obtain from the item is lost; 

An advantage of [3] is that it expresses the average performance of the 

class on a scale whose range is 0.00 to 1.00. This index* which is described 

in Whitney arid Sabers (1970), is interpreted as the percent of the distance 

from the lowest to the highest possible score that the class 1 average item 

score represents. Thus, values of P^ near 1.00 mean that, on the average, 

students knew most of the material required by the item, whereas values of 

P^ near 0;00 mean that generally students did poorly on the item. This 

interpretation is consistent with the interpretation given to p^ when is 

used with dichotomously scored items. A disadvantage of [3] is that a teacher 

may lose a sense of the absolute level of the scoies. For example, a 

P « .80 may mean Y> » 4,2, 4;0, 3,4, or 3.2 depending on whether (m., 1 ± 5 ■ 
i ^ 

(5, 1), (5, 0), (4, 1) or (4, 6), respectively. This confusion can be lessened, 
perhaps, by making sure that Y ± is available to the teacher upon request. 
The relationship between P^ and Y ± is as follows: 

? i " P i (1 i ~ m ±> + m i t5a] 
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The indices V i , ?&p ±? and P(0) i all describe in one way or another 
how the members of a group differ from each other. The item variance^ V^* 
has the advantage of being a single number that measures the spread of 
individuals. It has serious interpretive problems, however, from the 
teacher* s viewpoint. To understand a teacher would have to have a sense 
of the concept of a variance— a concept most teachers do not have. Second, 
is not expressed on the same scale as the item score s so the square root 
of would need to be taken. Third, one usually cannot compare the variance 
or standard deviation of one test item to that of another test item because 
of scaling differences. 

The functions [5a^ b] and [6], [7], and [8l provide the iuaximara infor- 
mation about how students perform on an item, but do so in noncomparable 
ways. The latent trait models represented by P(0)- describe the probability 
of each student answering the ith item correctly arid thus these models pro- 
vide a profile of the item performance over the full range of ability. But 
these latent trait item characteristic functions have serious drawbacks when 
used to describe the performance of a particular group of students. First, 
they express performance as a function of an arbitrary ability score, 0, a 
concept with which teachers are unfamiliar, Second, they cannot, arid probably 
should not, be calculated on sample samples of data such as are available to 
the teacher for the class at hand* Third, if a teacher uses items from an 
item bank Cor other source) which are already calibrated using one of the 
latent trait models, the display of an item characteristic curve can be easily 
misinterpreted. The ability distribution (on the 0~scale) of the particular 
students in the class is unknown and, hence, the teacher has rib way of knowing 
to which parts of the ability scale to refer in order to interpret the item. 
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Finally ^ teacher's tests are unlikely to be long enough or homogeneous 
enough to routinely use latent trait procedures to estimate students' 
abilities on the Q-scale. 

The function p ( x j)^ serve a purpose in helping the teacher under- 

stand how different levels of students performed bri the test. Function 
[5a, b] expresses the average item performance in terms of the total test ... 
score. (Sometimes this is called the item-test regression curve (e.g., tord, 
1980).) However, a teacher could use a scale other than the total test score 
in this function: It might be reasonable, for example, to use pupils' grades 
(A* B, C 9 etc.) in the subject from the previous marking period. 

Since experience indicates that the function 15a, b] will riot be regular 
for small samples when they are based on each possible value of the total 
score, it is likely that a useful form of [5a, bj is to group the students 
in some way and then show the average item performance for each of these 
subgroups. Upper half versus lower half is likely to be a too coarse and an 
uriirif ormative interval width, We recommend dividing the class of students irit 
either thirds (lower, middle, and upper scoring students) or fourths (using 
quart iies as the dividing points) of the class, if the number of students is 
between 25 and 40. Larger classes could be sectioned into fifths (using 
quintiles) . 

Summary . Table 3 summarizes bur recommendations based bri the rational 
consideration put forth above. These recommendations are further reviewed 
and modified as a result of some empirical (Monte Carlo) studies reported 
later in this report. 



Insert Table 3 here 



ible 3. Recommended item statistics for helping a teacher improve and guide instruction: Summarizing the 
performance of the class on each item. 



)e of item 
Bring: 


Basic: Should be included in every item 
analysis program, if at all possible. 


Recommended : Use- 
ful to include if 
(a) research shows 


Not recommended for 
item analysis pro- 
grams serving the 


Routinely present to 
or interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only. 


teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


above mentioned 
purposes. 


chot ditto its 

* - M) 


p( Vi 






v i 

P(6) i 


aded or 
sntinuous 

L -< hi 2 1 i ) 


p i 

p( Vi 









s: See the text for definitions, formulas, and explanations. 
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ffP;-^:V Wr should mention here that we are hot recommending that 
the difficulty parameter of a latent trait model, be reported for purposes 
of improving and guiding instruction. The considerations which led us not 



to recommend P(0)^ have led as not to recommend reporting for purposes of 
guiding and improving instruction. 

2 . Discrepancy betweei the performance a teacher expected of the class and 
the actual performance of the elass . The statistics listed in Table 2 are 
defined as follows: 



D1 i 


* P± 


" E Pi« Y ai " 0, 1 


19] 


02 i 


= Y i 


= EY ± i *i - Y ai ± h ' 


[ioj 


03 i 


= P i 


" EP i' m i 1 Y ai ± h 


[U3 


D4 ± 


= P i 




[12] 



D5^ = P(x j ) ± - EP(x j ) ± [13] 

D6. - A /M. , Y . « 0, 1 [14] 
i ii ai 

In the above formulas p., Y_:> Y , , m , , 1 . » P., and ?(x, x , have been defined 

i at 1111 J i 

previously in Formulas [l]-t5a^ bj . We note the following clarifying' 
definitions: 

E ■ expectation or "expected value of", but this 
is not necessarily a mathematical expection 
(see below) . 

» ah estimate of the proportion passing Item i 
in a norm group (e.g., a school district or 
children at the same grade level in the 
state's norms) 
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A ± -J 
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k., 



C li " S l± 



1-1 

88 actual discrepancy 

k^ = the number of options in i th multiple-choice 
item 



E14aJ 



t^ « the number of students the teacher expects to 

choose Option 1 of Item i 
s^ ■ the actual number of students who chose Option 

1 of Item i 

S hi " m±n (s li ) 

M ± - » ± - s hi + Es 1± [14b] 
1 t h 



« maximum possible discrepancy 

All of indices [9] through [14 J require a method for a teacher to use to 
specify how the students in the class are expected to respond to a particular 
test item. There are two general ways for a teacher to arrive at this expected 
performance for the class at hand: (a) use subjective judgment based on past 
experience with these students , arid (b) use empirical information and a statist- 
ical estimate. Although riot dismissing statistical estimates as inappropriate 
for the purposes at hand, we are inclined to favor the judgmental approach for 
most instructional purposes, especially for Ep^, EY^» a n <! EP^ in Equations 
[93 » [10] t an ^ [113 • We would like teachers to become directly involved in 
''messing around" with data from their students. We feel it serves important 
instructional purposes for a teacher to compare his or her expectations of 
pupils on particular test items measuring instructional objectives the teacher 
operationalizes via test items. If a disparity exists between a teacher's 
expectations and the pupils' performance, we believe this will be a powerful 
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motivator for the teacher to explore further for ah explanation, 

We could, of course, use various statistical procedures (regression, 
Bayesian analysis * etc.) to estimate how a teacher's class will perforin. Such 
estimates, will almost certainly contain errors of estimation due to scedasti- 
city. Further, these estimates would be created by a microcomputer program in 
a "black box" atmosphere about which a teacher is likely to understand very 
litte. It may well be that statistical estimates are more efficient suffi- 
cient, and consistent, but their impact on a teacher's behavior is likely 
to be less in such black box situations than if the teacher was more personally 
involved. 

Equations [9J , [10], and [11] correspond to Equations [1], [2], and [3], 

respectively. We have already indicated the advantages and disadvantages of 

D . Y . and P., and have indicated our recommendations with respect to each 
r i' i* i 

(see Table 3). To use [9]-[ll] in an item analysis program, a teacher would 
be asked to specify at the time the test is assembled (before it is admin- 
istered) the anticipated class performance on each item.* 

To be consistent with our previous recommendations, we would recommend 
using [9] and LH] whenever [i] and [3] are used. We anticipate, however, 
that [ilj will be difficult for teachers to use because it requires a two- 
step process : first estimate arid then estimate the percent of (1^ - m^) 
which Y^ represents. To avoid this complication, ve suggest that in an inter- 
active microcomputer program, the teacher be asked to specify n^, 1^ and EY^ 

*If experience indicates that this is too tedious to do for each item, 
various alternatives could be used. For example, the teacher could be asked 
to specify a single value Ep., EY. , or EP. that would represent the items 

and the deviation of each item from this single value could be computed. 
Another alternative is to write the program so that EP^, EY^ or EP.^ can be 

specified for only a few (say the most important) items, rather than all of 
the items* 
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then the computer can compute EP_^ via: 

EP i * t ~m^ iiui 

It sometimes occurs thst a teacher will use test items that have to 
be administered to a broader group of students of which the students in the 
teacher's class are a subgroup. For example, a school district may h^ve 
developed a series of mastery tests and have item analysis data available; 
a State may use a state-wide assessment program f6r which test items and 
.lata £.re released to the teacher; or a teacher may be using an it:em bank 
that contains items calibrated by one of the latent trait models. In cases 
such as these, it would be instructive to the teacher to compare the per- 
formance of the students in the teacher's class to the performance of similar 
students in the broader group. Equation [12] specifies this comparison for 
items score dichotomously • 

It is unlikely that a teacher would have access to items scored in a 
more continuous way since most large scale testing programs use multiple- 
choice items. An exception to this practice is the situation in which writing 
samples are taken and graded, a more frequent practice among school districts 
in recent years. (It might be noted that in many countries outside of the 
United States, essay tests are more frequently used than multiple-choice tests,) 
Although we do not treat the case of nondichotomously scored items here, we 
note that [12] could be adapted easily to accomodate essay tests. 

The quantity <£ ± can be obtained in several ways. Computer printouts arid 
technical reports obtained through the testing office of a school district 
would normally contain the pvoportion of students in the broader group passing 
each test item. These can be entered into the microcomputer. If the items 
a teacher uses measure a unidimensional latent trait arid have been calibrated 

via a latent trait model, then a more refined technique could be used to 

*v 

obtain <j>^. This is explained below, 

38 
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A teacher's class will vary from year to year in average ability 
(as expressed on the latent trait scale 9). If a teacher compares p^ for 
his or her class with the corresponding index for the broader group (district, 
state, etc.), the comparison may be somewhat misleading in that the more 
appropriate reference group would be "students with the same ability as those 
in this class 11 father than "students in general". In effect, the teacher- 
would like to hold ability constant and compare this class to those of similar 
ability. This can be done via the test characteristic curve arid item charac- 
teristic curve of latent trait theory in a way that keeps the resultant infor- 
mation in a metric the teacher can understand. The procedure is as follows; 

(1) Determe tha test characteristic curve for the test. 

(2) Compute the mean raw score, , on the test for the teacher's 
class, fei 

(3) Use this raw score mean arid the test characteristic curve to 
estimate the mean ability level, ^Uq' of the students in this 
teacher's class. 

(4) Use the estimated mean ability level of the class with the item 
characteristic curve, P(0)^, to estimate the quantity $ for 
this class. This is the proportion correct for item i in the 
norm group for those with ability equal to y^. 

The above procedure is illustrated in Figure I. 



Insert Figure 1 here 



Equation [13] describes the discrepancy between the expected performance 
of different levels of students with their actual performance on the ith item. 
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A. Test characteristic curve 



Item characteristic curve 



Figure 1. Illustration of the procedure used to estimate the proportion of 
the norm group (with the same ability as the teacher 1 s class) answering 
correctly the ith, item. 



ERLC 
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This equation corresponds to [5a* b] , As was indicated when P(x_). was 
discussed, it seems appropriate to divide the class into thirds or fourths 
Cthus j = 1, 2, 3 or j =1, 2, 3, 4) Unless the group is very large (j> 50). 
To use { 13] a teacher would be required to specify the expected average item 
score (either p^ or for each of the j levels of students. If precal- 

ibrated latent trait items are used, then *G^5j could be obtained for a 
particular norm group using the test characteristic and item characteristic 
curves in a manner similar to that described for Equation [12] and shown in 
Figure 1. Figure 2 illustrates this procedure for [13 ] when the class is 
divided into thirds ♦ 

Insert Figure 2 here 

Huynh (1983) recently suggested another index of item discrepancy which 
for multiple-choice items is defined by Equation [14] . This index is a ratio 
of the actual discrepancy between students 1 performance and a teacher ? s 
expectations to the maximum possible discrepancy for a particular set of 
teacher's expectations . This index requires the teacher to specify for each 
option 1, of multiple-choice Item i, the number of students expected to 
choose that option. The statistic represented by [14] close to 0,00 represent 
agreement between the pattern of student responses to a multiple-choice item 
and the teacher's expectations; values close to l r 00 represent disagreement. 

An advantage of [14] is that it permits teachers to specify a pattern of 
responses to multiple-choice items. Thus, teacher's would have to consider the 
nature of each option in relation to the students at hand • If the options 
were based on specific kinds of errors or misconceptions, the teacher would 
need to consider the number of students in the class likely to make each error 
type. While such fine-grained considerations as the expected number of stu- 
dents who would commit each type of error would seem to be a powerful means 
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A. Test characteristic curve 



B. Item characteristic carve 



Figure 2. Illustration of the procedure used to estimate the expected 
proportion of the norm group (at each level of ability as the teacher's 
class) answering correctly the ith item. 
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of improving a teacher's awareness of student performance ari< * be helpful 
in guiding instruction, we list several disadvantages: (1) many teachers 
do not use multiple-choice items * (2) teacher-made multiple-choice items 
may hot be based on particular error types, (3) teacher's may not have the 
patience to carefully consider for each item the number of students likely 
to choose each option, and (4) teachers may question the usefulness of such 
detailed specif if cations for every item. These practical, human engineering 
considerations lead us not to recommend the computation of [14] for purposes 
of item analysis programs designed to improve and guide instruction. However, 
E14J does seem to be a useful index for measuring the extent to which pupil 
responses deviate from a particular pattern. For example, an adaptation of 
[14] may be useful for detecting guessing patterns. (See a subsequent section 
of this report.) 

Summary. Table 4 summarizes bur recommendations based on the rational 
considerations described above. These recommendations are further reviewed 
arid modified as a result of the Monte Carlo studies reported later in the 
report. 



3. Unusual performance of a student on a test . The statistics listed in 
Table 2 are defined as follows: 



Insert Table 4 here 
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. Recommended item statistics for helping a teacher improve and guide instruction: Identifying 
discrepancies between the performance a teacher expected and the actual performance of the class. 





Basic: Should be included in pvpry ir* m 
analysis program, if at all possible. 


Recommended: Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


Wot recommended for 
item analysis pro- 


e of item 
ring: 


Routinely present to 
or interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only. 


grams serving the 
above mentioned 
purposes. 


hotomous 
i = 


D1 i = Pi - E Pi 

D5 i = P(x j ) 1 - EP(x j ) 1 




D4 1 » p £ - I 


D6 i * V M i 


ded or 
ntinuous 

< Y - < 4-1 
- ai - V 


D5 i * P(x j 5 i ~ EP ( x j>i 


D2 1 =Y 1 -EY i 

i 







: See the text for definitions* formulas, and explanations. 
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a perbis 



a lr 



S a A R 



1 4 aR 



, Y ai = 0, 1 



116] 



I (Y , " V**P; - P.) 



j = 1 Sl S - 1 ,I:iO,l 

a r perptbis " S 
r r a Y p 



tl7J 



NCI - 2S /S - 1 , Y " = 0, 1 
a a ai 



[18] 



if 



P(0 a ) i (l-P(0 a ) i ) 



, Y 



ai 



0, 1 



L19J 



Where ±n the above formulas: 

i « lj 2, . I items 

a » 1 * 2 i • • • ? N examinees 



ai 

2k 



X ■ total score (total correct) for examinee a 
a 

total number of students answering Item i correctly 
0, 1 ■ the item score for examinee a on Item i 



Az^ + 13 * normalized difficulty index 

inverse normal transformation of p_^. (p^ « proportion 
of the class answering Item i correctly) 
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i ■ 1 ia 1 

ri- 
a. 



mean A for the items Student a aarke4 correctly 

mean A for the items Student a reached 
I 

- I 4 i 

± « 1 

■ ' 1 ^ — 1 if Student a attempted all items 

standard deviation of the A- 

a R 

i j . (a i - aV 2 
if the student attempted ail items 



number of items Student a reached 

1$ if the student attempted all items 

ordinate of the normal curve which divides the area 

under the curve into proportions (n_ /n_~) and 

a. aR 

* " <V /n aR )] 

x n 

a 

mean item score for Person a 
N 

I 4 Y ai /N 
a ■ 1 

fraction of the class answering Item i correctly 

i - 1 1 

mean item difficulty 
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s 

p 



i = 1 al at 

i 

standard deviation of Person a f s item score 

_ i (p ± - io 2 
± - 1 



i 

standard deviation of the item difficulties 

S- ■ sum of the above-diagonal elements in a dominance 
a 

matrix for Examinee a when items have been ordered 

on the basis of p-values from easiest to highest 

Csee Tatsuoka & Tatsuoka, 1982) 

S ■ sum of all the matrix elements in the above mentioned 

dominance matrix (Tatsuoka & Tatsuoka t 1982) 

P(© ) ■ probability of Person a correctly answering Item i as 
a i 

this is predicted from the Rasch model 

First, we note that Equations U5J-[19] all apply to dichotomously scored 
items. Thus, to the extent that classroom tests are not dichotomously scored, 
these indices will be inappropriate to include in an item analysis program. 

Equations [15 J arid [18], the modified caution index (Harnisch S Linn* 
1981) arid the norm conformity index (Tatsuoka & Tatsuoka, 1982) 4 respectively, 
are based on the pattern of ah examinee's responses to items when the items 
have been arranged in order of difficulty from lowest to highest. If examinees 
respond to the items in a mariner consistent with their total test scores, the 
zero/one elements of an examinee-by-item matrix should appear much as a 
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Guttraan (1950) scalogram, That is* when examinees are arranged in order of 
total test score and items are arranged in order of difficulty, examinees 
with high test scores should exhibit ah unbroken string of Is; while examinees 
with very low scores should have a long unbroken string of 0s. High scoring 
examinees who break this pattern by responding incorrectly to very easy items 
(p* low scoring examinees who break it by answering difficult- items, should 
be identified via fl5j or fl8J as performing inconsistently. Pupils so 
identified by a statistical index can be brought to the attention of a teacher 
who can seek an explanation. 

In recent empirical studies of these two indices Harnisch and Linn (1.981) 
and Rudrier £1983) found they correlated quite highly with each other when 
they were computed on the same data. The modified caution index, however, 
correlated less with the total score than did the norm conformity index 
(Jlamisch & Linn, 1981), When the purpose of using an index is to identify 
persons with unusual response patterns, it is undesirable for that Index to 
be confounded (.and hence correlated) with the total test score. Using the 
correlation with the total test score as a criterion, the modified caution 
ihdex^ fl5J^ would be preferred ovsr the norm conformity index ior our 
purposes. 

Equations fi6j and [17J are correlational indices: the personal biserial 
^Donloh & Fischer, 1968), [I6j , and the personal point biserial, [17], The 
empirical studies by Harnisch and linn (1981) and Rudner (1983) demonstrate 
that these indices are highly correlated with each other when computed on 
the same data, Harnisch and Linn also found that be tli indices were correlated 
with the total score to an unacceptable degree arid that sometimes the personal 
point biserial had a nonlinear relationship to the total score. 
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Using simulated data, Rudner found that [16] and [17] identified 
aberrant score patterns of examinees more frequently than j.i5j and [18] when 
a 45 item "classroom test" was simulated; [15] and [18] seemed to identify 
aberrant score patterns more frequently than [16] and [17J when a longer, 
80-item, "commercial test" was simulated. Thus, although all four of the 
indices are intercorrelated they do riot identify unusual score patterns with 
equal ef fectiveriess. Iri an unpublished study Meyers (reported in Donlon and 
Fischer, 1968) found that if test items are generally difficult for a group 
of students, those students who had a better command of the subject (as a 
result of having taken a course in the subject) tended to have somewhat lower 
personal biserials. 

Finally* Dbnlbn arid Fischer point out that the item difficulties (ETS 
As) used in [16] should be derived from a sample independent of the one of 
which the examinee whose personal biserial correlation is being computed, 
otherwise the personal biserials will tend to be higher because the person 
is part of the sample. 

Classroom tests are typically short: shorter than the 45 item test 
Rudner studied. Further, a teacher may not have available item difficulties 
from previous administrations of the items. Finally * the typical class size, 
25-35, is a rather small sample and would surely accentuate chance depen- 
dencies in the data. These considerations, along with the finding of re- 
Marchers such as Harriisch, Liriri, arid Rudner, lead us to conclude that [16] 
and [17] should riot be used in a classroom item analysis program. 

Equation [19 J is the unweighted person fit statistic from the Rasch 
model. It would be used only when the teacher had access to items previously 
calibrated by this model. This statistic compares a person's actual responses 
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to the items^ Y_.^ with the person's average (expected) response, P(0 ),, 
a a i 

when the personfs ability score, 0- t is known. This squared deviation 

a 

standardized by dividing by the variance of the expected responses for that ability 
level f summed over all items, and averaged, Rudner (1983) indicates that 

- i 

1 19.3 is more influenced by student responses to very easy and very difficult 
items | His empirical investigation indicated that {19 j was not a very 
accurate identifier of aberrant response patterns for a simulated classroom 
test of 45 items , but that fl9j did function well with ah 80-item simulated 
commercial test, Given these findings we believe that insufficient evidence 
exists to include this index in an item analysis package for use with short 
classroom tests of the type typically encountered in schools. Therefore, we 
do not recommend including it in a typical item analysis package, 

Summary, Table 5 summarizes bur recommendations for this section, These 
recommendations are further reviewed and modified as a result of empirical 
studies reported in a later section. 

Insert Table 5 here 

4. ♦Hierarchical ordering of the items oh a test . A number of techniques 
exists for constructing hierarchical orderings among items (e , g , , Alrasian 
& Bart, 1973; Bart 6 Krus, 1973; Wise, 1981; Takeya^ 1981), Tatsubka and 
Tatsuoka 0.981) reviewed these techniques and found Takeya T s to be most 
appealing because it is ,f ,,, mathematically elegant, and it has algebraic 
relations with Loevinger *s hombgeniety £1948] index, Mbkkeh ! s J.1971J index ff9y 
caution index (Satb^ 1975), and Cliff's fl977j index C t3 fl (p, 1), 

TakeyaTs procedure (cited in Tatsubka & Tatsubka 1981) defines an 
order structure by determining the expected proportions of dominance rela- 
tionships between two items. This procedure is called item relations 
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Table 5. Recommended item statistics for helping a teacher improve and guide instruction: Identifying 
unusual performance rf a student on a test. 



'ype of item 
coring: 


Basic: Should be included in every item 


Recommended; Use- 


» 

Not recommended for 


analysis program, if at all possible. 


ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


item analysis pro- 
grams serving the 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only. 


dchotomous 
Y ai H 6 '» 


h 






a r perbis 

a r perptbis 
NCI 

a 

v- 
a 


raded or 
continuous 

m i i Y ai f V 











te: See the text for definitions, formulas > and explanations. 
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structure analysis ClRSA) ; "The advantage of using IRSA is ^according to 
Takeya) that it enables as to see a cognitive aspect of a student's per- 
formance on the items to a certain extent • Since it generates a digraph 
representing the hierarchical structure of the items* it will—at the very 
least— allow us to check the extent to which we have succeeded in constructing 
problems that require a hierarchically specified set of skills for solving 
them" (Tatsubka & Tatsuoka, 1981, pp. 1-2). Tatsuoka and Tatsuoka used the 
procedure to successfully construct a digraph of the structural relations 
among a set of 24 items measuring knowledge of addition and subtraction of 
fractions. 

Although the results reported by Tatsubka and Tatsubka are encouraging, 
more experience is heeded with microcomputer computation in order to decide 
bh the practicality of the IRSA approach. Specifically, computer memory 
requirements and speed of computation need to be determined. Therefore * we 
recommend the technique be used only if the particular microcomputer to be 
used ±s capable of handling the heeded computations. 

We note that an IRSA matrix for a particular set of items is subject to 
errors of sampling students. In order to be meaningful a hierarchical arrange- 
ment of Items should apply to a defined population of students father than 
only to the particular students at hand. Thus, there should be some sample 
to sample stability of the IRSA matrix. We know very little about the in- 
fluence of student sampling on the fluctuations of the IRSA matrix. The 
nature of the stability of the IRSA matrix should be a topic for further 
study. 

SifTHTTwrTy . bur recommendation is summarized in Table 6. We will not pro- 
vide in this report empirical data to further clarify our recommendation. 
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Insert Table 6 here 

5. Change in a class' performance oil ail item after instruction ; A number 
of item indices based on a pretest-posttest (or a two group) difference 
have found their way into the literature on criterion-referenced testing. 
Among these indices are proportion-based indices such as the (a) pretest- 
posttest difference (eox & Vargas, 1966), (b) uninstructed-instructed group 
difference (Klein & Kosecoff, 1976), (c) individual gain (Roudabush, 1973), 
net gain (Kosecoff & Klein, 1974) , <c) maximum possible (Brennan, 1974) , 
(e) B index (Hsu* 1971; Brennan, 1972), and (f) internal sensitivity (Kosecoff 
& Klein, 1974). There are correlational approaches as well: (a) item- 
criterion group partial r (Darlington & Bishop, 1966), (b) item-total change 
scores (Saupe, 1966), and item-criterion group multiple-correlation (Darlington 
& Bishop, 1966). Most of these indices have been suggested as types of 
discrimination indices for selecting items for criterion-referenced tests in 
a manner similar to discrimination indices previously discussed in the 
literature in connection with norm-referenced .tests . An excellent summary and 
review of these indices has been provided by Berk (1980). 

Our purpose in this section is to consider item indices that provide a 
teacher with useful information about how a class* performance on an item 
changed as a result of instruction. We note, however , that we do not recommend 
that the above iildices be used for item selection. Most pretest-posttest types 
of Indices are subject to rather large sampling fluctuations when used with 
small groups. Secon t, teachers that blindly follow a statistical rule of 
thumb for culling items on the basis of the value of a statistical index are 
likely to be deceived: (a) items not showing change may still represent 



55 



Table 6. Recommended item statistics for helping a teacher improve and guide instruction: Identifying a 
hierarchical ordering of the items on a test. 



Type of item 
scoring: 


Basic: Should be included in every item 
analysis program, if at all possible. 


Recommended : Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


Nnh TPpnininpnflp/'l Fni* 

HWt L C ImUIIIIUCUUCU iUL 

item analysis pro- 
grams serving the 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only. 


Dichotomous 
(Y al - 0,1) 






1RSA 




Graded or 
continuous 

(ffi i * Y ai <- V 











Jbte: See the text for definitions^ formulas, and explanations. 
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important behaviors to be monitored, (b) danger exists in cutting from the 
domain those items which a teacher has not taught well, and (c) items not 
showing favorable pretest to posttest changes may represent erroneous be- 
havior that pupils have acquired as a result of instruction. Also, as 
Ebel £1972) demonstrated, quirks in items themselves often lie behind pre- 
to posttest perf omrance anomalies, 

Pf etest-pbsttest indices can be useful to the teacher in identifying 
those items on which pupils in a particular group perform in unexpected ways. 
A teacher arned with such information may then decide whether the items 
require revision or whether the fault lies with the instruction rather than 
with the item. 

Among the most useful of these indices for the specific purpose of 
improving instruction are: Cox and Vargas (1965), Roudabush (±973), and 
Kosecof f and Klein (1974), In addition, the index proposed by Brennan (1972) 
and Hsu (1971) has value in examining items when a meaningful passing 
(mastery) score can be set. the latter requires special interpretive cautions, 
however, because it cannot be computed when all or none of the students meet 
the passing score arid because the ideal index is zero. The eox and Vargas 
index-<~the difference in the proportions passing from pretest time to post- 
test time — is a rough gauge of an item's functioning before and after inter- 
vening instruction and is likely to be easily understood by teachers. The 
Roudabush ihdex--the proportion of pupils answering an item correctly at 
posttest time who also answered it incorrectly at pretest time — more clearly 
focuses bit changes in pupils' performance and it too can be understood by 
teachers. The Hsu and Brennan index describes how well an item distinguishes 
between test passers and nonpassers and* s 6> is not quite a measure of before 
and after instruction change . 
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For purposes of giving a teacher information about changes in a 
class' performance as a result of instructions we would recommend Cox and 
Vargas (1966) index, Equation L20J , arid Roudabush (1973, Equation [21], 
which are defined by the formulas below. 

i D postpre = ±Ppost - ±Ppre , Y ai » 0, i [20] 

i D indgain " M • Y ai = °' 1 . 1213 

In the above formulas the notation is the same as that used in equations [ij- 
[19] except that: 

impost = proportion of the class answering Item i 
correctly on the posttest. 
^Pp^ t 23 proportion of the class answering It«m i 
correctly on the pretest 
i^Oi ** proportion of the class answering Item i 
correctly on the posttest but incorrectly 
on the pretest 

Indices L20J and [21] are limited to use with dichbtbmbuily score (0 or 1) 
items. However, we can derive comparable version of these formulas for item 
that are scored in a graded or more continuous fashion : 

B* - pbst Y i ~ pre Y i_ < y < L [22] 

i postpre 1 i m i ~ "~ 

D* i i - sf>* i. , m, < Y, < i. [23 j 

i indgain ±* prepost i — ia — i 

i D i*dgain = ± Pp epost i »i 1 Y ±a < l ± 124J 
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In formulas [22] through [2U\ 9 we : V2 the notation that follows: 

Y. 88 average score of the clas^ on Item i when 
post i 

it is administered at pdsttest time 

— Y _ = average score of the class on Item x when 
pre • i 

is is administered at pretest time 

1^ « maximum possible score on Item i 

m^ ■ lowest possible score on Item i 
CD 

i P prepost " \ I i P bd 1 23a] 

b—l ct^c 

» proportion of the group fail? 7 at pretest and passing 
at pdsttest time 
h « 1, 2, . . . , B 

=» indexes the score categories on Item i at pretest time 
d » 1 , 2 , • ..,!) 

*■ indexes the score categories ori Item i at postte^t time 
C - the index number of the minimum passing score on Item i; 
1 £ C £ D 

p « the proportion of examinees taking Item i that scored 
i bd 

in the bth category at pretest time and in the dth 
category at pbstw^>st time 

i P prepost " |^ i P bd [24a] 
» proportion of examinees taking Item i who scored higher 
at pbsttest time than at pretest time 
Equation [22] converts the difference between the meax: pretest and post- 
test scores of the group to a percent of the maximum possible difference. If 
simply the mean difference is desired then the numerator of [22] can be reported 
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We would recommend, however, that the numerator not be reported to teachars 
for purposes of guiding instruction; We would recommend instead making 
available to teachers the actual pretest and posttest means. 

If a nondichotomously scored item is assigned a passing score, then 
Equation [23] can be used to examine the shift in the percent of students 
passing from pretest to posttest time. This index would be comparable in 
interpretation to [21] . If no cutoff score or passing score is needed then 
Equation [24] can be used. This equation computes the percent of students 
in the class who improved their score on Item i from pretest to posttest 
time. Figure 3 shows the data layout for [22 J , [23] , and [24] . 

Insert Figure 3 

Summary. Our recommendations in this section are summarized in Table 7. 
We do not provide empirical data on these indices. 

Insert Table 7 here 



5 # Summary of the seriousness of the types of errors pupils committed oh ah 
item . In order to provide remedial instruction, a teacher needs to know the 
types o r . errors and misconceptions a student has. An item analysis program 
should provide some way of summarizing for each item the seriousness of the 
errors committed by the students at hand. In this way, a teacher can focus 
first on those items for whicli students 1 errors seem to be most **.ri heed of 
remediation. 

Tatsuoka (1981) developed a quantitative index of the seriousness of 
errors of different types. Her approach is to use an analog of the norm 
conformity index, J18J , in which students 1 patterns of erroroneous responses 
to items are compared against an ideal (and correct) set of steps for 
solving a problem or completing a task. The index requires (a) specifying 
a task "tree" of procedural steps (similar to task performance networks 
developed by cognitive psychologists such as Gagni (1968), Gregg (1976) , 
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Figure 3. Data layout, for Equations [22], [23], and [24] . Equation 
[22] uses the pre- and post test means. Equation [23] is 
the sum of the proportions in the upper right quadrant . 
Equation [24] is the sum of the elements in the upper 
triangular portion of the matrix. 
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Table 7. Recommended item statistics for helping a teacher improve and guide instruction; identifying 
changes in a class' performance on an item after instruction 



Type of item 
scoring: 


Basic; Should be included in every item 
analysis program, if at ail possible. 


Recommended ! Ubp- 


liUL LCLUllUilCllUCU lUl 


ful to include if 
(a) research shoes 
teachers can use 
and (b) micro- 
computer has su' 
ficient rcaory and 
speed. 


Item analysis pro- 
grams serving the 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
teacher on every test 
item. 




Make, available to 
teachers upon their 
request only. 


Dichotomous 

% ' 


i postpre 
i D indgain 








Graded or 
continuous 


1 postpre 

p- - 
i indgaln 


^postVpreV 

D* ■ 

i Indgain 







Note: See the text for definitions, formulas, and explanations. 
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Greeno (1976), a:id Re snick (1976).) ^ (b) classify pupils' errors with 
respect to types * arid (c) analyzing each error type according to the 
particular procedural steps that were violated. The Tatsuoka approach is 
a useful one, as .demonstrated by her research, but we believe it is too far 
ahead of the current capabilities of the typical classroom teacher to be able 
tc develop the procedural steps and analyze them in the way necessary to use 
her approach. We can conceive of a computer program to do some of this analysis 
for the teacher once a procedural task network is specified arid pupils' responses 
are entered into the computer. We believe that this would be beyond the 
practical capability of an item analysis program designed to serve the daily 
needs of teachers. We would encourage research efforts along the lines of 
Tatsuoka (1981), however. 

What seears more in the realm of possibility is to ask a teacher to rate 
the seriousness of pupil errors committed bri each item and then to siunme rize 
these for each item by displaying the frequency with which each degree of 
seriousness occurs in the class and computing an average of these degrees of 
seriousness. A teacher would be required to rate the degree of seriousness 
of the errors committed by each pupil on each item. The indices to be com- 
puted are: 



» p 



» • • • » p 



> 6± . .. 




[25] 



J 



I P 




t26j 



where 



, » a teacher's rating of the seriousness of a 
J 1 

pupil's error(s) on Item i 
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j m 1, 2, J 

■ indexes the different ratings of a teacher on Item i 
P j ^ = the percent of the class who received an error rating c? 
oh Item i 

The ratings^ T j±> ^an ^ e assigned by the teacher for nonmuitiple-choice 
items or by the microcomputer if a teacher specifies the seriousness, r j^» 
for each option of each multiple-choice items. 

Insert Table 8 here 

7 # Summary of the types of errors committed ori an item . Instead of , or in 
addition to, rating the seriousness of each error type, a teacher could 
classify the errors tu each item according to type, t^. The item analysis program 
can summarize the percent of the class committing each type of errors. Thus, 
Ht^) - (p(t li ), p(t 2i ) , p(t ?i ), p(t Ji: )) 127 j 

where 

1 * 1, 2, J 

» indexes the different types of errors 
35 the j th type of error on the ith. item 

pCtj^) « the percent of t r :e class committing the j th type of error ori 
Item i. 

Since tj^ is likely to be nonraetric, the mean or average error type has no 
meaning. 

Insert Table 9 here 
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Table 8. Recommended item statistics for helping a teacher improve and guide instruction: Summarizing 
the seriousness of the types of errors pupils committed on an item. 



Type of item 
scoring; 


Basic; Should be included in every item 
analysis program, if at all possible. 


Recommended; Use- 


Noi recommended for 


ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
iicient memory ana 
speed. 


itfcm analysis pro- 
grams serving the 
2 jove mentioned 
purposes. 




Routinely present to 
or interpret for a 
teacher on every test 
item, 




Hake available to 
teachers upon their 
request only. 


Dichotomous 
"at '• °'» 


h 








Graded or 
continuous 


F( tjl ) 




* 





Note; See the text for definitions, formulas, and explanations. 
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Table 9. Recommended item statistics for helping a teacher improve and guide instruction: Summarizing 
the types of errors committed by students on ah item. 



• 

Type of item 
scoring: . H 


Basic: Should be included in every item 
analysis program, if at all possible. 


r — — — 

Recommended; Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory arid 
speed. 


■ — i 

Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
tnacher on every test 
; cem. 


Make available, to 
teachers upon their 
request only. 


Dichotomous 
1 ai 0,1) 


Fftjj) 








Graded or 
continuous 






— 





Note: See the text for definitions, formulas, and explanations. 
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RewritinK Individual Test items 

In this section we review a number of item statistics for conveying 
the information in table 1 in the category of rewriting or revising a 
particular test item. We take into account a little more of the sampling 
distributions of these statistics because if a teacher is revising ati itetii, 
the teacher expects the Icem to perform in a certain way in the future. We 
consider sampling distributions in a more empirical way in a later section 
of this report* 

Table 10 providdii a xist: and a brief description o? several statistical 
indices which have &om< potential value for providing inf >rmation for teachers 
seeking to use pupii data to revise items and which have some possibility of 
being computed via a micrecomput er . Belov we describe each of these statistics 
in more detail, pointing to the advantages and disadvantages of providing them 
as part of a microcomputer Item analysis program for classroom teachers. 

Insert Table 10 here 

1. Extent of i^ m-objective congruence . If an item does not fit a teacher's 
instructional objective it should be revised. Item-objective congruence can 
be judged by a teacher or by a group of teachers. Ii ; an individual teacher 
rates the item-objective congruence we designate it: 

R^j^ « the rating of the degree of correspondence [28] 
of Item i to Objective j by Teacher k 
We suggest that a rating scale be developed for a teacher to use in which the 
numbers on the rating scale have verbal anchors describing various degrees 
of correspondence. An alternative procedure is to use scte adaptation of 
the Mager (1973) scheme for judging item-objective congruence, perhaps 
quantifying the rating in a manner similar to the error seriousness measure 
of Tatsuoka (1981) . We suggest that this latter approach be further explored > 
but recommend for the moment that [28] be used as a simple rating as described 

71 
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Table 10. Statistical item data potentially useful for helping a teacher 
rewrite individual test items; 



Type of information a teacher 

could use 



Possible statistical indices 



li Extent of item-objective 



congruence 



An individual teacher's rating 
of the degree to which an item 
matches a specified instruc- 
tional objective. 
Index of item-objective con- 
gruence (Rovinelli & Hambleton, 
1977) . The average rating of 
several judges as to whether 
item i matches objective k. 



2. Extent of item-itistruc- R*. 

ki 

tional event congruence 



Mdti. 



si 



An incV. vidual teacher's rating 
of the degree to which an item 
corresponds to what the teacher 
taught in class ox what the 
students were expected to study. 
The median rating of students as 
to whether the teacher or 
learning materials taught the 
content on which the item was 
based. 
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Readability grade level of an 
item as determined from a 
readability formula. 
Percent of words on a defined 
grade-level list. 
Mean percent of students at a 
particular grade-level passing 
a word-meaning test for the 
target words in the item. 

The fraction or percent of the 
entire class passing a 
dichotomously scored item. 
The mean item score, of the 
entire class for ah item that 
Is scored in a graded or 
continuous way. 
The mean item score, 
expressed as a percent of the 
maximum possible item score. 
The difficulty parameter of 
an item calibrated via a 
latent trait model. 



3. Vocabulary level of 
an item. 



4. tct*n difficulty level 



r*. 
8± 



r*; 



n 
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fable ±0, (cont.) 



5, Item discrimination 




The net D discrimination index 


level 




(Johnson, 1951). Difference 
between the percent passing in 
the upper and lower scoring 
pupils in the class. 




± r b±s 


The biserial correlation betveen 
item score and total score. 




a i 


The discrimination parameter of 
an item calibrated via a latent 
trait model. 



6. Idantif ication of poor D(p A4 - p T , . 

distractor? 



P Lji " 0 



For each option j of item i, 
the difference between the pro- 
portion of upper and lower 
scoring pupils choosing that 
option. 

The option j for which the 
fraction of lower scoring 
pupils choosing that option 
equals zero. 



7. Identification of 

ambiguous alternatives 



P Uji~ P Uj'i >P Uki 



Two options, j and j for 
which the same number or per- 
cent of u; >er scoring pupils 
choose these cptions and for 
which these percents are larger 
than the perceuts for other 

alternatives, % ri . . 

• Uki 
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Table 16. (cbrit.) 



8. Identification of 
miskeyed items 



aax(p u . i )>p uki 



The option j may be miskeyed 
if the percent of the upper 
scoring group choosing it, 
max(p . .) is greater than 
the percent of the upper 
scoring group choosing the 
keyed option, p^. 



9. Identification of 
patterns of guessing 
among knowledgeable 
students 



SSQ, 



Frequency chi-square to 
testing the goodness of fit 
of the observed proportion 
of the upper group choosing 

each opt* on f P^jji* t0 a 
uniform distribution. 
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above . 

Reliability of rating item-objective congruence is gained by having 
several teachers (or other content experts) judge each item. Hambleton 
(1980) reviews several methods: (a) rating all possible item-objective 
pairings (Rovtnaiii and Hambleton, 1977), (b) rating scale* and (c) matching 
task. The latter consists of having each teacher attempt to match up the 
teat items on one list with the objectives on another. Items for which there 
exists a lot of disagreement in matching among the teachers are revised. The 
rating scale method consists of presenting teachers with a list of test items 
already matched to objectives and asking teachers to judge the degree of 
correspondence between each item and its corresponding objective. Items for 
which the median rating is low and /or for which the variance of the ratings 
is large are revised. We prefer the rating procedure t? the matching pro- 
cedure because it seems to be a simpler ^ask for teachers and rather straight 
forward and it asks teachers to judge the extent to which they believe that 
items already sorted into categories by objectives have been properly sorted. 
Disadvantages of the rating technique are (a) that someone has to do an 
initial matching of the items and the objectives and (b) it does not allow 
every item to be compared to every objective. (It sometimes happens in 
practice that items will correspond better to objectives for which they were hot 
supposed to match. Tbu rating procedure does not allow for this aaomally 
to be detected.) When the fating procedure is used we recommend that the 
median rating be the suannary index. 

f-in ■ median rating of the correspondence Item i [2§ J 

to Objective j by several teachers 
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The Rbvinelli and H&mbletori (1977) procedure for judging item-dbmaih 
congruence seems to be the most thorough of the three procedures; It 
requires that each teacher in the group judge each item against each objective 
and rate each pairing as: +1 if the item definitely measures the objective, 
6 if the teacher is undecided about the match, or -1 if the item definitely 
does not measure the objective. A large number of comparisons are required. 
If, for example, there are 10 objectives with 3 items per objective, there 
are 300 (» (10 x 3) items x 10 objectives) pairs to judge. A disadvantage of 
this technique is that because of the larg~ number of comparisons to be made* 
it is very time consuming. We prefer it, however, to the rating method if 
time permits its use because it does allow all items arid objectives to be 
reviewed . 

The method is implemented by collecting the ratings and entering them 
into the formula below. The numerical value of the index obtained from the 
formula for each test item does not depend on the number of objectives or 
the number of teachers doing tu« rating. The index ranges in value from 
-1.00 to Hi. 00. A value of 0.00 indicates; that teachers cannot agree that 
Item i matches Objective j; a value of +1,00 indicates that all teachers 
agree that Item i matches Objective j; and -1.00 indicates that ail teachers 
agree that Item i does not match Objective j. The formula given by Rbvinelli 
and Hambleton is: 

(J - 1) k I L *kj± "^jkL^'kh^ [30] 

2(J - 1)K 
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where 

1 = index number of the item 

j * 1, 2, . J indexes the objectives 

k ■» 1, 2, , k indexes the teachers 

* the rating of the kth teacher of the degree of correspondence 
of the ith item to the j th objective 
When using £29^ a cut-off value , is specified. Any item for which i ^ < 

Cj^ is revised to match the objective better. 

Summary y Table 11 summarizes bur recommendations here. We do not provide 
further empirical data for these indices. 

Insert Table 11 here 

2. Ext e nt of item- instructional event congruence . An item should be revised 
if it does not correspond to what the teacher taught or what the students were 
assigned to study. We call this the item- instructional even congruence. The 
teacher, the students, or both can be asked to judge* the degree to which ah 
item corresponds to the instructional events of the classroom. We list two 
indices below i 

R*j^_ ~ the kth teacher's rating of the degree [31 ] 

of correspondence of Item i to vhat was 
tauftht to the students 
Mdn^ ncdian rating of the students in the kth [32 j 

teacher's class as to whether the material 
in Item i was taught by the teacher or 
covered by the materials/ 
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)le ii; Recommended item statistics to use to help a teacher revise individual test items: Extent 
of item-objective congruence. 





Basic: Should be included in every item 
analysis program, if at all possible. 




1 ■ ! r . 




Recommended: Use- 
ful to include if 
(a) research shows 


Not recomrr<?aded for 
item analysis pro- 
grams serving the 
above mentioned 
purposes. 


e of item 
ring; 


: " tlnely present to 
c ; . interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only. 


teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 












notomous 


\n 


Mdn. 
R ji 




{[see text) 


fed or 
itinuous 




Mdn. 


■'ii 




< Y < t ) 
- ai - V 








(see text) 

i 

i 













: See the text for definitions, formulas, arid explanations. 
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In order to implement [3L] and [32] rating scales need to be developed; 
We suggest a 4 or 5 point rating scale that has verbal anchors describing 
various degrees of overlap with instructional events. We also suggest that 
different scales be developed for students and teachers, items receiving 
a low rating by the teacher may need to be revised (e.g. , if the item came 
from a set provided by the textbook publisher) , or the teacher may have to 
alter the instruction. If students do not perceive the item as related to 
what they were taught or studied (i.e., median rating is low) then a teacher 
may need to discuss the item with the studrncs before deciding whether to 
revise it. 

Summary , table 12 summarizes our recommendations . 

Insert Table 12 hers 

3. Vocabulary lev*, ? of an item . Several indices are suggested in Table 10 
to judge the appropriateness of the wording of an item. Several readability 
formulas exist and some could presumably be implemented via a microcomputer . 
For each item, one applies a readability formula to obtain the item's readability 
grade level* r^. Readability formulas, however , require several long passages 
to be analyzed (e,g., Fry (1979) or Bormuth (1969)) „ Even when long passages 
are analyzed, some reading specialists question the validity of these for~ 
muias (e.g., Instructional Objectives Exchange* 1980). 

When readability formulas are discounted, about the only alternative 
left is to use a word list of some type. Word lists attempt to identify the 
pool of words that are appropriate to use on tests (and other materials) 
designed for students at a particular grade level. Several approaches have 
been used to develop word lists (IOX, 1980) : (a5 tabulating the words 



Si 



Table 12. Recommended item statistics to use to help a teacher revise individual items: extent of item 
instructional event congruence; 



Type of item 
scoring: 



Dichotomous 

S a ■ o,i) 



Graded or 
continuous 



Basic: Should be included in every item 
analysis program if at all possible, 



Routinely present to 
or interpret for a 
teacher on every test 
item. 



Hi 



Make available to 
teachers upon their 
request only. 



Mdn 



ksi 



Mdn 



R* - 
ksi 



RecolnendeJ: Use- 
ful to include if 
(a) research shows 
teachers p use 
and (b) micro- 
computer ps suf- 
ficient memory and 



Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes. 




Note: See the text for definitions, formulas, and explanations. 
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appearing at <>&oh grade level in publiahed reading textbooks series 
(e.g., Taylor, et at., 197° (b) listing the words at each grade level 
that students know the raeatiu.. - of (e.g., Dale and O'Rourk* 9 1976) ^ (c) 
tabulating the frequency with which words appear in general reading materials 
(such as newspapers arid magazines) (e.g., Carroll, Davies, and Richman , 
1971; Sakiey arid Fry, 1979), arid (d) some combination of (a) through (c) 
(e.g., IOX, 1980). 

One could tabulate for sach ±zer* the number of words in the item that 
are on a particular list at a particular grade-level and convert this to 
a percent, 

r* = ^L- L33J 

where 

ti , = number of words in Itenui that are found on the 
gi 

appropriate list of eligible tfbfds for that grade 
level , g . 

n^ total number of */ords in Item i, 

If thi3 percent is less than some specified level (perhaps 1.00) the item 
would be revised. 

Some word lists were developed b; asking students to check ihi\ words 

they knew the meaning of or by giving students nrattipie-chbice vocabulary 

tests to determine t**.fec knowledge, the percent ^passing" each word is 

then listed. If a i ast item's words are checked against such a list for 

a particular grade, and the percent of students in the norm group passing 

each woru recorded then, brie index for the vocabulary level of an item 

K . 
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where 



p^g^ ™ percent of the norm group at Grad * g knowing 



the meaning of Word k in Itein i 

K— , ■ the number of words in Item i which were 
g± 

located on the particular word list for 
Grade g 

g » I, 2, .... 12 indexes the grade, level 

k « 1, 2, ...^ indexes the words found in the list 

A disadvantage of J33] is that an item may contain words not on thu pre- 
scribed word list which are either (a) above the grade level intended for, 
the item or (b) suitable for the grade level. Thus, if r*j is less tb 1.L0., 
no ir^iediate course of action can be recommended excep * to check the >czabcl^ry 
Disadvantages of [34] are th»£ (a) K gi may be quite a bit lower than and 

(b) the values of p . may t? based on a nor^ group that is not appn ?r J -\te 

kgi 

for the local pupils. A disadvantage of both BSl aLI t3^ J i- that is take* 
a long time to have a microcompu t er check the vocabulary level of an item 
since each word in the item has to be checked against a long list of aitable 
oi target words. Further, the test deem itself would have to entered into the 
microcomputer (i.e., an Iters, bank wou3d be needed). 

Of the two procedures for checking word lists , we recommend P3. ] s:ince 
its interpretation is likely to be somewhat easier for teachers. We suggest 
that items be flagged for teachers and that the words rib- en the word list 
be listed (or otherwise identified) for the teacher. Since a computer prog?:am 
for doing this type of word processing and checking may not be feasible for a 
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small microcomputer, we recommend that it hot be incorporated into a 
typical item analysis package destined for a computer with small memory. 

Summary . Our recommendations in this area are summarized in Table 13. 

Insert Table 13 here 

4. Xi :em difficulty level. The item diff^ zulty level indices lister 
Table 10 are the same ones listed in Table 2 (except there are fewer ir? 'icei 
in Table 10) - Our recommendations for item difficulty indices are listed in 
Table 14. These are essentially the same as the r ecommenda t ions in Table 3. 
For purposes of identifying items that should be considered for revision, it 
seems better to look at fhe overall difficulty of the item in the class. 

We would not recommend using b^ the item difficulty level b£ a latent 
trait, model, for identifying teacher-made items in need of revision. Latent 
trait models (Lord , 1980; Rasch, 1960; Wright & Stone, 1979) offer another 
conception of item difficulty: The point on a number line representing the 
underlying latent abilit; at which the slope of the item response curve is 
maximum. Large samples or pupil responses are needed to calibrate items 
using latent rrait models. While some 3 ^rge school districts have che 
capacity to calibrate pools of items, mcst. do not. Classroom tiH rrscomputrm 
are unlikely to have the kind of computing capacity needed to calibrate ams. 
Further, many classroom teachers would have difficulty because the coace^ 
bz latent trait and item response functions are not commonly used. Additionally, 
there is no compelling educational or psychological reason to believe that 
single objectives (or other instructional domains) bugh fe to be unidimensional 
(cf., Nitko. 1974), a requirement needed iti dr^-r for latent trait theor: \ 
be applied. Thus^ although items pre-calibrated by latent trsit methods car 
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Table 13. Recommended item statistics to use fie help a teacher rewrite or revise items; Vocabulary 
level of an item 



Type of item 
scoring: _ _ 


Basic: Should be included in every item 
analysis program i, if at all possible* 


Recommended ; Use- 
ful to ir. elude if 
(aj research shows 
teacheis can use 
and (b) micro- 
computer has suf- 
ficient raeraory and 
speed. 


Kfit T"( i rnn>niPnHp»"] f nr 

*» vfc L CVwUUUCUUvU L UI 

item analysis pro- 
grams serving the 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
teacher on every tfest 
item; 


f 

Make available to 
teachers upon their 
request only. 


Dichotomous 
(Y ai " °>» 






g± 




traded or 
continuous 

[ffl i - Y ai - V 






gi 

i 


r ± 



iL*s; See the text for definitions, fwdas, oad explanations. 
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be employed in the classroom, this method is unlikely to be of widespread 
practical value to teachers in the revision of test items. 

Summary . Table 14 summarizes bur recommendations. We discuss sampling 
fluctuations of p^ these statistics in a later part of this paper. 



Insert Table 14 here 



5; i^./'tg discriminatory level . The following are the definitions of the 
statistics listed in Pa^t J or Table 10 along with a few additional ones. 

D i " Pui - hi ' Y ai " °' 1 [35] 



Y - Y 

D ; - P 1 Li - , m. < Y . < 1, [36] 
i - i — ai — i 



Ai. B ^sr - 7 r~ ' T a ' °- 1 [37] 

y i 



a^ =» discrimination parameter of a latent [391 

trait model , Y_. « 0 f i 
ai * 

In the above formulas: 

^Ui ™ Percent of the upper or higher scorir.g (on the total 

test) group of students who answer Item i correctly 
p_ , ° percent of the lower scoring (on the total test) 

Li J. 

£roup of students who answer Item ± correctly 



rable U. -Jx,i. ,.c!e<i item '-tatistics to use to help a teacher rewrite or revise items: Item 
Acuity ler/el 



Type of item 
scoring: 


1- 

1 Should be included in every item 
1 analysis program, :'J at all possible. 


Recommended : Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes. 


1 _ • 
FvCCinely present to 
or .interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only. 


Dichotomous 
(Y ai ' 


h 








Graded or 
continuous 

(m i - Y ai i 1 t ) 


i 


it 







ote: See the text for definitions, formulas, and explanations. 
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mean total test score of the students who 
answered Item i correctly 



=» mean total tes* :~ore of tne stv^e;;ts ;?ho 

answered it** 1 ?, x '^correctly 
» difficulty ^aex of Item i defined by 

Equation [1] 

"y . _ 

Ui m mean score on Item i of the upper or higher 

scoring (on the total test) group of students 

mean score on Item i of the lower scoring 

(on the total test) group of students 



S v = standard deviation of the total test scores 

A. 

of the students 
y^ =" ordinate of the normal curve c<~ -responding 
to the area equal to 

Traditionally, item diacri^ ;natibr> refers to the extent to whic 7 m 
item is able to differentiate among Individuals with various levels ~ cbt-.l 
test performance. Most classroom test construction textbooks recoumend 
usirg the net D index, [35], for dic;ir . jmously scored . terns because ±r is 
easily computed and understood by teachers. Qriytaaiiy proposed by A. 
Pemberton Johnson (1951), this index has the . advantage of describing the 
fraction of net correct discriminations an item makes (Findley* 1956), 
Here, a correct discrimination means that the item is answered correctly 
by a high scoring examinee and incorrectly by a low scoring examinee. (The 
index has been shekel izo have gaod properties when used to select items for 
purposes of measuring relative achievement (White, Feidt, & Sabers, 1975)). 
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The use of [35 j, and any index described in this section for that 
matter, for the purpose of domain-referenced classroom test item revision 
is not immediately clear, but as we ino^ '' c.j t:»uiler; if on a particular 
item low scoring pupils do better than high scoring pupils (i.e., the item 
discriminates negatively) tl/: item needs to be studied more carefully before 
the teacher decides to kesp it intact or to revise it (e.g,* Popham & Husek, 
1969). It would be a sensible standard practice to revise items that have 
discrimination indices below zero regardless of the purpose of the test, 
unless th«»re is some compelling reason not to. 

Equation [36] was suggested by Whitney and Sabers (1970) as a counter- 
part to [33] for items sco^d in a graded or continuous manner. It expresses 
che tflean difference between the upper and lower scoring groups as a percent 
of che distance between the maximum possible score, 1^, and the; minimum 
possible score, m^ for Item i. An alternate version of l 36 j which makes its 
meaning clearer is 

n' « P - - P- [36a] 
D i r Ui Li 

where P and P ^ are computed for the upper and lower scoring groups in a 
Ui Li 

manner similar to Equation [2 J , 

Equations [37] and [38] are correlational indices of item discrimination. 

For oar purposes here, they are considered for the purpose of identifying 

poorly discriminating item that would be identified and flagged for revision 

by a teacher. Thbrndike (1982) reviews the characteristics of tfc biserial 

ana point biserial correlations as itt=za -Uscrimination indices (particularly 

a3 l:hey relata to *e^~«:±ri£ items for standaiuizeii tests). The point biserial 

is affected by the itai .dif Halty , p ± * which curtails the possible range 

of r Thus its value is confounded with item difficulty. The 

i pb is 
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is not as confounded with item difficulty ; but for small samples and 

in skewed distributions, its numerical value can go beyond the bounds of 

+1 and -1. The biserial correlation cannot be used in the standard formulas 

for estimating total test statistics from item statistics. 

The disadvantages of the point biserial arid biserial correlations would 
argue against using them to identify items for a teacher to consider re- 
writing. Net D, [3*5], seems to be a more straightforward statistic to 
compute and interpret to teachers. We note, however, that net D is also 
confounded with the iter difficulty level, p i (Ebel, 1979). 

We would riot reconm.-cid using the latent trait parameter, a^, for 
identifying items for : ,v».chers to rewrite, for reasons similar to those 
offered for not recoct ding, b^. It should be noted that if a^ were to be 
used, its use would b-- limited to precaiibrated items of the two- and three- 
parameter models, Equations [7] and [8], since in the one-parameter model 
all s-values are equal. ; 

With the axceptidii of 135,' all the above mentioned discrimination 
indices are used only with dichotomously scored items. Graded or continuously 
scored items can be analyzed with correlation analogues of the biserial and 
point biserial correlations, namely, the polyserial and poir:t poly serial 
correlations (Olsson, Drasgow, & Dorans , 1982) , respectively. 

Summary . We summarize bur recommendations for this section in Table 15. 
We provide some empirical data on these statistics in a subsequent section. 

Insert Table 15 here 

6. Identification of poor distractors. The distractions of a multiple- 
choice item have a specific function: appear as plausible answers to those 
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Table 15. Recommended item statistics to use to help a teacher rewrite or revise items : Identifying 
poorly or negatively discriminating items, 



Type of item 
scoring: 


Basic: Should be included in every Item 
analysis program, if at all possible. 


Recommended: Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer hao suf- 
ficient memory and ' 
speed, 


Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes. 




Make available to 
teachers upon their 
request only; 


Routinely present to 
or interpret for a 
teacher on every test 
Item. 


Dlchotomous 

i« al ■ 0,1) 




i 


i 


i r bis i 
± r ptb±s 

a i 


Graded or 
continuous 


7 






x 

i ptpdlysejrial 


i r polyserial 



Note: See the text for definitions, formulas, and explanations, 
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students who do hot have the degree of knowledge needed to choose the 
correct answer to the item. Since it is in the lower group that we would 
expect to find those lacking the requisite degree of knowledge, we would 
expect the item data from the lower group to provide information about 
poorly functioning distractors. One of two definitions of a properly 
functioning distractor is often used: (a) a distractor is properly function- 
ing if more persons in the lower group than in the upper group choose it and 
(b) a distractor is properly functioning if at least one person in che lower 
group chooses it. Improperly functioning distractors are either revised, 
replaced, or removed from the item. The following equations are consistent 
with these two definitions: 



where 



D <V = (d H' d 2i' d hi } ' Y ai = °« 1 C401 



H « o — p- . [41a] 
a ji p Uji p Lji 

j = 1, 2, . .., h- indexes the options of an 

h-option multiple-choice item, j ? correct 

answer 

P Uji S the P ro P ortion of the students in the upper 

scoring group choosing Distractor j of Item i 



p iA4 = the proportion of the students in the lower 
Lj i 

scoring group choosing Distractor j of Item i 
Equation [40] provides the set of differences between the proportion 
choosing each distractor. If d.- ± <_ 0* then Distractor j would be flagged for 




ax 

the teacher to consider revising. Equation [41] considers drily the lower 
group and looks for a Distractor j for which s 0. When this criterion 

is met, the Distractor j is flagged. 

Of the two formulas, we prefer [40] since it will identify more dis- 
tractors for the teacher to review. In particular, d^ ± may be less than or 
equal to zero even if some persons in the lower group choose Distractor j - 
The fact that more upper than lower scoring pupils choose an incorrect option 
should be brought to the teacher's attention. 

Some standardized test developers use the biserial or point biserial 
correlation between total test score and choosing Distractor j as an index 
of distractor quality. We do not recommend this for the analysis of teacher- 
made test items for two reasons: (a) because of those reasons specified 
previously in connection with the discrimination index 9] arid (b) because when 
net D is used as a discrimination index, the data are set up in a way to make 
[40] simple to compute. 

We recommend also that p^ and p^ 1 be made available to the teacher 

upon request. 

Summary. Table 16 summarizes our r ecommenda t ions for this section. 



Insert Table 16 here 



7. Identification of ambiguous alternatives . Here we seek to identify 
multiple-choice items that contain ambiguous alternatives* Our definition 
of ambiguous is similar to that of Sax (1980): Two alternatives of a multiple 
choice item, j and j', are said to be ambiguous if the same percent of the 
upper scoring students zhoose j and j'} and if this percent is the largest 
percent among the alternatives. One expression for this relation is: 

fyi - fori i p uici * «• K > o'"™ 1 • Y ai - - 1421 
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Table 16; RecoMendfcd item statistics to use" to help a teacher rewrite or revise items: Identifying 
poor distractors 



in""" "if i 7 

lype of ttcm 
scorinf*: 


Basic: Should be included in every item 
analysts program, if at all possible. 


Recommended: Use- 
ful to include if 
(a) research shows 


Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes; 


Routinely present to 
or interpret for a 
teacher jii every test 
item. 


Make available to 
teachers upon their 
request only. 


teachers can use 
and (b) micro- 
computer has suf- 
ricienc memory and 
speed. 


Dichotomous 

\i • 9 -» 


D(dji) 


P Uji 
P tji 




P Lji * 0 

(see text also) 


Graded or 
continuous 

( "i - hi - V 






» 





Note: See the text fbi definitions, formulas, and explanations; 



where 

(p Uli' P U2i' P Uk±' P Uhi ) = the pr ° portion of the 

upper scoring group 

choosing each distractbr 

We know of no particular index other than [42] or some function of 142 J that 

is suitable for this purpose. 



Insert Table 17 here 



8, Identification of miskeyed items . We define miskeying to have 
occurred when the teacher inadvertantly scores an incorrect alternative as 
the correct answer to Item i for all students. Under this conditions, 
the largest percent of upper scoring students would choose the right answer 
to Item i, bat it would be marked wrong. This can be specified as follows: 

™t*m* * P Uki ' Y ai - °> 1 [43] 

where 

j = 1, 2, . h indexes the alternatives 
to Item i 

^Uji "* t * ie P ro P ortion °f the upper group choosing 



Alternative j of Item i 



p^^ * the proportion choosing the keyed alternative, 
k, of Item i 
j * k 

As with Equation L42J , we know of no other indices other than, perhaps, 
simple transformations of [43] . 



Insert Table 18 here 
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Table 17. Recommended item statistics to use to help a teacher rewrite or revise items: Identifying 
ambiguous distractors 



Type of item 

*/[**■ V«i J. U Will 

scoring: 


Basic: Should be included in every item 
analysis program, if at all possible. 


fal to include if 
(aj research shows 
teachers can use 
and [bj micro- 
computer has suf- 
ficient memory and 
speed. 


Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes, 


Routinely present to 
or interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only, 


Dichotomous 
(v al ■ 0,1) 


P Uji = P Uji > P Uki 








Graded or 
continuous 











Note: See the text for definitions, formulas, and explanations. 
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Table 18. Recommended item statistics to use to help a teacher rewrite or revise items: Identifying 
ambiguous distractors " ' 



Type. of item 



Dichotomous 
(T at ■ 0,1) 



Graded or 
continuous 

( 'i - hi 1 V 



Basic ; Should be included in every item 
analysis program, if .it all possible. 



Routinely present to 
or interpret for a 
teacher on every test 
item. 



Make available to 
teachers opon their 
request only. 



max(p„, ) > p , - 



Recommended : Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 



Note: See the text for definitions, formulas, and explanations. 



iU for 



item analysis pro- 
grams serving the 
above mentioned 
purposes; 
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9. Identification of patterns of guessing among kn owle dgeable students . 
Here we want to use an index that would allow us to flag an item for a teacher 
if the pattern of responses to it indicated that students who should know the 
answer to the item are behaving in a random fashion. Two indices of possible 
guessing behavior that are consistent with this purpose are the following. 



RU 5 



I - Pjl log 2 ( Pjl 



-lbg^ (1/(^-1)) 



observed 



U 



144J 



maximum 



D7 



^observed 
^maximum 



h 

I 



(t Hi./ h ± )wti uji 



145] 



n u..- min(n U ji ) + 1 n uji 



where 

p-- = proportion of the entire class choosing Distractor j 
of Item i 

h^ =" the number of alternatives for Item i 
"uji = the numDer of students in the upper scoring group who 

chose alternative j on Item i 
riy - number of students in the upper scoring group 

Equation [44 j is known as the relative uncertainty index arid was suggested 
by Pike and Flaugher (1970). This index takes on values between 0.00 and 1,00 
and reflects the extent to which examinees respond in a manner that would pro- 
duce a flat (uniform) distribution of p^ values over the distf actors (wrong 
answers) of a multiple-choice item. The RU^ index has been used successfully 
to study guessing pattersn in several standardized tests; the PS&T (Pike & 
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Flaugher, 1970), the GRE (Pike* 1980) * the 3R y s (Khampalikit, 1982), and 
the Joint College Entrance Examination of Taiwan (Hsu & Khampalikit, 1980). 
it was also used to study a college level classroom test (Hsu & tiou, 1982) 
but with less success. A major problem that occurs with RU^ is that p^ 
cannot be equal to zero when computing the log. Thus, if all students can 
eliminate one distractor , R\J ± cannot be computed. This is particularly 
problematic for classroom tests. A second difficulty with using RU^ as 
stated in [44] is that is considers all students, not just upper group 
students. We would expect the lower scoring students tc guess on classroom 
tests and teachers may well encourage them to do so. It is in the upper 
scoring group of students that we believe we should find patterns indicating 
that they are responding in a more informed manner . Pike arid Flaugher do 
suggest that [44] be computed for various subgroups, but again as the 
number of responses to each alternative become fewer, the computation and 
interpretation of [44] becomes problematic. ; 

Equation [45] is an adaptation of the Huyhh Q1983) index defined by 
Equation [14]. Although we found L 14 j not to be practical for identifying 
items exhibiting teacher-pupil discrepancies* the adaptation, [45], seems 
useful for the purposes of this section. We substitute for i 1± in L14J 
the expected frequency of choices for each alternative if the upper group 
responded randomly («n_ /h) . The value of D7 j _ is near zero if the students 
do respond randomly and is one if students do not respond randomly. Note 
that unlike the relative uncertainty index, 07^, considers (a) only the 
upper scoring group and (b) all alternatives, not just the distr actors. 
We believe these characteristics to be advantages since (a) it is when the 
upper group begins to guess randomly that a teacher's attention should be 
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drawn to the item and (b) if the upper group is randomly guessing their 
guesses will include the possibility of choosing the correct alternative 
as well as .the dis tractors. 

Still a third index could be computed (this is given in Table 10) , the 
frequency chi-square for testing the goodness of fit to a uniform distribution 
of the response pattern of the upper group to ail alternatives. This equation 
is 

hi 2 

I (n uji - (n udi /h i )} 

SSQ 1 = [46] 

(n Uji /h i } 

where anc * ar ® as defined for [44] and [45]. We believe [46] to be 

too variable with small n TTji so that if a strictly statistical chi-square 

Uji 

criterion is used to decide whether the pattern of responses is uniform the 
user would be subject to committing a Type II error with high probability. 

Summary • Table 19 summarizes our recommendations that Equation [45] 
be used to identify items where guessing may be occurring among the upper 
group . 

Insert Table 19 here 

Selecti- iig Items fcb Put on a Test 

We presented the rationale that will guide our review of item statistics 
for purposes of improving the total test score properties on pages 13-16. 
In this section we review several statistical indices arid make recommendations 
concerning which should be included in a microcomputer item analysis program 
for classroom testing. 

We assume in this section that the indices will be used to select items, 
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table 19. Recommended item statistics to use to help a teacher rewrite or revise items: Identifying 
patterns of guessing among knowledgeable students 



type oi item 
scoring: 


Baalci Should be Included In every item 
analysis program, if at all possible. 


Recommended: Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
teacher on every test 
item. 


Make available to 
teachers upon their 
request only. 


Dichotomous 
(, ai ' C '« 


« 






"i 

SSQ 1 


Graded or 
continuous 








1 



Note: See the text for definitions, formulas, and explanations. 
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rather than for trying to improve instruction by reviewing items or trying 
to obtain information about which items need to be revised • We are assuming 
also that the items have been revised and tried out so that the statistics 
(or the data for the statistics) are available. Thus* the items exist in 
some pool or bank arid that the item analysis program will compute arid 
interpret certain statistical indices associated vith each item. 

We assume that a teacher will use different classroom tests for different 
purposes as we outlined on pages 13 and 14. Among other things this means 
that item statistics will need to be used in combination in order to select 
items to put on any particular test. The reader is urged to keep this in 
mind when reading below, because we initially focus on each category of item 
statistic separately. 

Table 20 lists several item analysis statistics which seem on the sur- 
face to be suitable for our purposes here. Below we will review them. 



Insert Table 20 here 



1. Item discrimination level . The three item discrimination indices in 
Table 20 have been defined and discussed previously (Equation [35] - [39]) 
for other purposes. Here we note that it seems most appropriate to use net 
D as specified in [35] and [36] for most classroom tests in a way that we 
will describe shortly. We do not recommend the correlational indices ^p^^ 



and _,r, ._, or their polyserial counterparts, 
i bis 

We do recommend, however, that if the teacher has access to an item 
bank containing items calibrated on a two-parameter or three-parameter latent 
trait model (Equations [7] or [8]) that the item discrimination index, a^ be 
used. It would not be possible for a small microcomputer to compute for 
classroom tests, but if a^ were already available, it is possible to create 
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Table 20. Statistical item data potentially useful for helping a teacher 
select items to put oh a classroom test. 

Type of information a teacher 

could use Possible statistical indices 

1. Item discrimination level Same as Table 10 

i r bis Same as Table 16 
a J Same as Table 10 

Same as Table 10 

Same as Table 10 

Same as Table 10 

Same as Table 10 



3. Relation of the item to 
test blueprint and/or 
domain specification 



4. Estimated total test 


X 


Estimated mean of the total 


statistics 




test scores when the items 
selected so far are used. 




ft 


Estimated standard deviation 
of the total test scores when 
the items selected so far are 






• 

used. 




KR20 


Estimated Kuder-Richardson 
test reliability when the 






items selected so tVr are used. 
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2. Item difficulty level p 1 

P. 
i 

b ± 



Same as Table 10 

ij^ Same as Table 10 

ID. - . A code for the location of the 
kli 

item in a content by objectives 
grid (i.e., test blueprint) 



a program that would help teachers choose items, this program should use 

both a ± and b^^ (i.e., the latent trait item difficulty index) to help a teacher 

design a test for measuring relative achievement using the item information 

function (=£p(0-)-l)- 
at 

Our recommendations in this area are summarized in Table 21. 

Insert Table 21 here 

2. Item difficulty level » Item difficulty indices have been discussed 
previously and our recommendations for other purposes summarized in Table 3 
and 14. Our recommendation for item difficulty indices ir* this section are 
the same as those for Table 14, except that ve would recommend that the 
latent trait parameter b^ be incorporated into the item analysis program 

in the manner suggested above for using a^, the latent trait item discrim- 
ination index. 

Insert Table 22 here 

3. Relation of item to test blueprint and/or domain specification . 
Our recommendations for these congruence indices are listed in Table 11 

and as Equation [28] through [30]. We note here that in addition to a rating 
of how well a test item matches an objective, it is necessary to identify 
the content topic and level of understanding covered by each test item. 
This is not a statistic per se, but it is an index number that helps the 
teacher to identify the item and to check a test's balance of coverage. 



113 



Table 21. Recommended item statistics to use to help teachers select items to put on a classroom test: 
Item discrimination indices 



Type. of item 
scoring i - 


Basic; Should be included in every item 
analysis program, if at all possible, 


Recommended: Use- 


Not recommended for 


ful to include if 
(a] research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


item analysis pro- 
grams servtnR the 
above mentioned 
purposes. 




Make available to 
teachers upon their 
request only. 


Routinely present to 
or interpret for a 
teacher on every test 
item. 


Dichotomous 
"at ' »•» 








i r bis 
i r ptbis 


Graded or 
continuous 

h : hi i ¥ 




• 




r 

i ptpolyserial 
i polyserial 



: See the text for def initios, formulas, and explanations related to Equations [35] - [39]. 
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Table 22. Recommended item statistics to nse to help teachers select items to put on a classroom teci: 
Stem difficulty indices 



Type of item 
scoring: 


Basic; Should be iiicl 


uded in every item 
t all possible. 


Recommended: Use- 




analysis program, if a 


ful to include if 

(a) research atmua 

teachers can use 
and (b) micro- _ 
computer has suf- 
ficient memory and 
speed. 


item analysis pro- 
grams serving ine 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
teacher on every test 
item, 


Make available to 
teachers upon th; 'r 
request only. 


Dichotomous 

»a ' 


p t 








Graded or 
continuous 


p t 


\ 







Mote: See the text for definitions, formulas, and explanations related to Equations [1] - [3] and 
[6] - [8]; 
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Wr call this index number: 

FV-ii = index number of the itk item [47] 
kl± — 

iri relation to the 1th topic 
in the unit and the kth level 
of understanding 

Insert Table 23 here 

4 . Using combinations of indices to select items for classroom tests* 
The item statistics identified above cannot be used independently, but must 
be used in combination. The particular combinations to use depend on the 
type of decisions for which the test is to be used and, in particular, on 
whether the test is to be used to measure absolute bx relative achievement 
and whether partial or complete ordering is desired. If lament trait 
parameters are available and the measurement of relative achievement is 
desired, then the microcomputer program can use a^ b J , and in connection 
with the item information function to help design a test that will provide 
the most information possible at certain ability levels. Lord (1980) pro- 
vides guidelines for this process. 

But most teachers will not have access to items precalibrated by latent 
trait methods. Rather* they will have items for vhich are available simply 
item difficulty, item discrimination, and some indication of wh^t the item 
is measuring. We recommend that the item analysis prog am incorporate some 
rules of thumb that will help the teacher to select items using the latter 
statistical indices when the test purpose is specified. Table 24 summarizes 
the rules of thumb we recommend. The rules of thumb in this table are con- 
sistent with modem concepts of item analysis and test design as these have 



tie 23. Recommended item statistics to use to help teachers select items to put on a classroom test: 
Relation of the item to the test blueprint and/or domain specification 



e of item 
ri*&4 


Basic: Should be included in every item 
analysis program, if at all possible. 


Recommended : Use- 
ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 
speed. 


Not recommended for 
item analysis pro- 
grams serving the 
above mentioned 
purposes. 


Routinely present to 
or interpret for a 
terxher on every teat 
item; 


Make available to 
teachers upon their 
request only. 


fiotomous 
- t s 0,1} 


\n 
l \n 


Mdn Rji 






ded or 
itinuous 

< Y < 15 
- at - V 


hit 
ID ki ± 


Mdn Rji 

Hi 







; See the text for definitions, formulas, and explanations related to Equations [28] - [30] and 
[47]. 
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articulated by tord (1953, 1980) and Hehryssbn (1971) 



Insert Table 24 here 



5. Estimated total test st atistics . An item analysis program that is 
to be used to help teachers select items should provide estimates of the 
properties of the total test scores based on the selected items . The item 
statistics recommended in Table 21 and 22 for dichotomous items can be used 
to estimate the test mean, standard deviation, and Kuder-Richardson formula 
26 reliability as follows: 



.1 



I 



[48] 



= estimated mean of the test composed 
of I items 



_<v». 
1 SD X 



I 

i^=3^ 



Y ai = °» 1 



[49] 



,KR20 



ai 



0, 1 



[50] 



where 



i « 1, 2, i indexes the items 

selected for the test 
• the net D discrimination index for 
dichotomously scored Item i 
p - the difficulty index for dichotomously 
scored Item i 
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Table 24. Rules of thumb for using item analysis data to build 
classroom tests. 





Relative achievement Is the focus 


Absolute achievement 
la the focus 


Complete ordering 


Partial ordering (two groups) 


General 
concerns 


Ranking all the pupils in 
terms of their relative 
attainment in a subject area. 

i 


Dividing pupils into two 
groups on the basis of their 
relative attainment. Pupils 
wLthin_each_group will be 
treated alike. 


Assess the absolute status 
(achievement) of the pupil 
with respect to a well-defined 
domain pi instr uctionaliy 
relevant tasks 


Specific 
focus of 
test 


Seek to accurately describe 


Seek to accurately classify 


Seek to accurately estimate 
the percentage of the domain 
each pupn can perform 
successfully 


differences in relative 
achievement between 
individual pupils. 


persons into two categories. 


Attention to the 
test's blueprint 


Be sore that items cover alt 
important topics and 
objectives within the 
blueprint. 


Be sore that items cover all 
important topics ana 
objectives within the 
blueprint. 


Be sore items are a 
representative, random 
sample from the defined 
domain which the blueprint 
operationalizes. 


How the 
difficulty 
index fp) 
Is used 


Within each topical area of 
the blueprint, select those 
items with. 


Within each topical area of 
the blueprint, select those 
items with p-values slightly 


Don't select items on the _ 
basis of their p-values. bat 
study each p to see if it is 
signaling a poorly written 
item 


(1) p between 0 16 and 0.84 
if Derformance on the test 
represents a single ability. 

(2) p between 0:40 and 0:60. 
if performance on the test 
represents several different 
abilities 

Note Jtems. should be easier 
than described above if 
guessing is a factor. 


larger than the percentage of 
oersons to be classified in the 
upper group [e.g.. jf the class 
is to be divided in half (0:50) 
then items with p-values of 
about 0.60 should be 
selected, if the division is 
lower 75% vs^upper 25%, 
items should have p ** 0.35 
(approximately)]: 

Note: The above suggestion 
assumes the test measures a 

sing!e__abi[ity 










How the 
discrim- 
ination" 
Index t'DJ 
Is used 


Within each topical area of 
the blueprint: select items 
with D greater than or equal 
to +0.30. 


Within each topical area of 
the blueprint, select items 
With D greater than or equal 
to + 0.30. 


All items should have D 

greater than or equal to 0:00 
Unless there is a rational 
explanation to the contrary, 
revise those items not 
possessing this property. 1 



I 



Source: Nitko (1983, pg. 301) 
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Expression 149] was derived by Ebel (1967) under the assumption that the 
test scores are normally distributed. The sampling distribution arid standard 
errors of these estimates are unknown and the effect of non-normality on 
equations [49] and [50] is unknown. Expression [48] does not depend on 
distribution assumptions. 

If items are scored continuously, then [48] becomes 

^ S ji^i ' "i-^i- 1 ! [51] 



The following expressions relate item scores (either continuous or 
dichotomous) to total test score standard deviation and reliability 



I (SD y ) 2 
1 _ a± 



8f 



- ^ - 2 



153J 



In the above formulas r v is the Pearson product moment correlation 

ai a 

between the item scores and the total test score on tho try but edition of the 

test. If is dichotomous, then this correlation becomes the £ r p^is" 

Equations [52] and E53J are useful and may provide better estimates than 

[49] and [50]. It is recommended that i r ptbig be corrected so that it 

estimates trie correlation of each item with the common true score measured 

by the whole set of items as suggested by Henrys son (1971), 

If r v is unknown, Thorndike (1982) suggests estimating its mean 
ai i 

value, t- , from past experience and substituting this estimate in 152] 

Y ai X i 



123 



and [ 53 j . Further* if the items are dichotoinous and the average difficulty 

of the items oh the test, p^ f can be estimated equations [5ij-["53] can be 
simplified as follows: 

Ji m Ip 154J 



3 S \ " V.X, l55J 



ai i 



,KR26 = 



I 1 - z~It> 2 '] . ,56) 



Y-.X. 
ai i 



It should be noted that [55] overestimates the standard deviation 
(Thorndike, 1982). 

Summary , Our recommendations for this section are summarized in Table 

25. 



Insert Table 25 here 



•ita Concerning the Sampling, 
JOuctuations in Selected Item Statistics 
In an effort to obtain more information about the sampling fluctuations 
of some of the item statistics recommended it! this report, we undertook a 
sampling study with the assistance of Dr. Huynh Huynh of the University of 
South Carolina. We sought to simulate the fluctuations in students that 
might occur from year to year in a teacher's class. To do this we used the 
item response data bank available at the University of South Carolina in 
connection with technical research conducted by the Mastery Testing Project 
(NIE-G-78-0087) and the Technical Works of Basic Skills Assessment Programs 
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Table 25. 



Recommended item statistics to help a teacher select items for a test: Estimating total test 
properties. 



Type of item 
scor ing : 



Dichotoraous 

fi a ■ CD 



Graded or 
continuous 

< l i- < ' ! .i. <1 t' 



Basic: Should be included in every item 
analysis program, if at all possible; 



Routinely present to 
or interpret for a 
teacher on every item 
selection situation. 



3A 



f 



-V 



Make available to 
teachers upon their 
request only. 



ended: Use- 



ful to include if 
(a) research shows 
teachers can use 
and (b) micro- 
computer has suf- 
ficient memory and 



item information 
function 



Not recommended for 



analysis pro- 
grams serving the 
above mentioned 
purposes. 



h 



Note: See the text for definitions, formulas, and explanations related to formulas [48] - [56]. 
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Project (NIE-G-80-0119) as these were applied to the South Carolina Basic 
Skills Assessment Program (SCBSAP) . A basic description of the SCBSAP is 
given in Huynh and Castell (1982), 

The data base used in our study consisted of responses from 2400 
students in each of several grades who had taken the Mathematics and 
Reading tests of the SCBAP in 1981. This large group was selected as a 
stratfified cluster sample of the South Carolina student population. The 
Reading test contained 36 items and the Mathematics test contained 30 items. 
Within each grade level four items were selected for study. In the population 
of 2400 students the items selected had p-values between approximately 0,85 
and 0.55, the range of p-values we believe is likely to be encountered in 
teacher-made domain-referenced tests. 

To simulate fluctuations from sample to sample 80 random samples of 
30 students each were selected and the various item statistics were computed 
for each sample. The samples were selected such a way that some (if not all) 
of the 30 students within a sample were from the same classroom. We note 
that the class-to-class or year-to-year fluctuations experienced by a teacher 
are likely to be less variable than fluctuations based on simple random 
sampling since a teacher will generally use a test either within the same 
school building (usually associated with a neighborhood) or in different 
buildings but within the same school district. Simple random samples from 
a state's population should be more variable since any one sample would 
contain students from widely scattered school districts with quite diverse 
characteristics . 

It is likely, however, that the sampling distributions we report are more 
variable than a teacher might experience, lying somewhere between a distribution 
of strictly random samples and a distribution of within classroom samples over 
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years* This is because, although we sampled students within a classroom, 
students in the subsequent sample came from another school distric ; 

In this paper we report only the preliminary results; since the study 
is on-going. We report sampling fluctuations for the following statistics: 
item difficulty, item discrimination, proportion in each third of the class, 
modified caution index for items, arid chi- square. Each statistic is com- 
puted for each of four items as. follows: 



Reading Mathematics 
Grade 4, Grade 6 Grade 2 Grade 6 

Item 18 Item 34 Item 4 Item 21 

<f> * 8.597 <j> - 0.560 $ = 0.564 <j> = 0.559 

b - 0.959 b - 0,785 b - 2.288 b » -0.011 

Here, cj> is the proportion of the 2400 students answering the item correctly 
and b is the Rasch item difficulty for three items. Because this is a pre- 
liminary report of bur empirical study, we have not reported data on the 
other items investigated. 

Table 26 shows the empirical sampling distribution of the item difficulty 
index, p^ for each of the four items. The four distributions are roughly 
comparable. Sample p-values range from approximately .84 to ,30. The mean 
of each distribution is reasonably close to its expected value, <f>. However, 
the distributions are slightly more variable than expected. The standard 
error of a proportion based on random samples is 



where <j> is the population proportion and N is the sample size. For each of 
the distributions in Table 26, a is approximately 0.09, whereas the actual 
standard deviations are around 0.10. 



Insert Table 26 here 
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Table 26. Empirical sampling distributions for the item difficulty index p 





Reading 


Mathematics- 




Grade 1 


Grade 6 


Grade 2 


Grade 6 




Item 18 


Item 34 


Item 4 


Item 21 




$ = 0.597 


cf> - 0.560 


0 = 0.564 


4> ■= 0.559 


Values of p 


b - 0.959 


b = 0.785 


b - 2.288 


b - -0.011 


.95 = 1.00 










.90 - .94 










.85 - .89 










.80 - .84 


1 


1 




1 


.75 - .79 


4 


1 


1 


3 


.70 - .74 


3 


7 


4 


6 


• Dj • D-7 


8 


6 


5 


4 


.60 - .64 > 


23 


20 


18 


19 


.55 - .59 


9 


9 


10 


8 


.50 - .54 


15 


15 


17 


14 


.45 - .49 


10 


7 


9 


11 


.40 - .44 


1 


7 


9 


10 


.35 - .39 


1 


3 




2 


.30 - .34 




4 


1 


2 


.25 - .29 






1 




.20 - .24 










.15 - .19 










.10 - .14 










.05 - .09 










.00 - .04 










Mean 


.59 


.55 


.56 


.55 


Std. Dev. 


.09 


.11 


.10 


.11 


No. of samples 


80 


80 


80 


80 
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Table 27 summarizes the empirical distributions of several item discrim- 
ination indices. The distributions behave as expected. Note that the net 
D index was computed on the basis of upper and lower thirds and upper and 
lower halves. As expected the items show less discrimination when the halves 
are used compared to the thirds: The mean discrimination index for the 
halves* distributions fun approximately 0.10 to 0.14 lower than the means of 
the thirds distributions. Since on the average the persons in the halves 
groups are closer in ability to each other than are the average persons in 
the thirds group, this result is expected. Further, since there are more 
students in the halves groups than in the thirds groups (15 vs. 10 students) 
the sampling distriSution of net D when computed on halves is less variable. 

With a lower mean discrimination value and less variability, more poorly 
discriminating items would be identified if the upper and lower groups con- 
sisted of the halves of the class rather than the upper and lower thirds. 
For example, if V ± < 0.30 is used as a rule of thumb for flagging a poorly 
discriminating item, then in 80 replications, Item 18 would be flagged 1 
time using the thirds procedure vs. 8 times with the halves procedure, 
Item 34 one time vs. 10 times, Item 4 five times ve 26 times, and Item 21 
twelve times ve 26 times. We would take a conservative view stating that it 
is better to flag an item and have a teacher check it than to let the item 
go by unreviewed. Thus, we would recommend using the upper and lower halves 
for the net b index. 

Insert Table 27 here 

Table 28 shows distribution of the proportions of students passing an 
item in the upper, middle, and lower thirds of the class based on the total 
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fable 27, Empirical sampling distribution. for item. discrimination indices (D 1/3 = net D computed using 
upper and lower thirds of the class, D 1/2 = net D computed using the upper and lower Halves, 
BIS 5 biserial correlation-, P-BIS a point biserial correlation). 
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test score. The sampling distributions are as expected: lower third students 
answering the item correctly in fewer numbers than the middle arid upper third 
arid variability as indicated by sampling theory. An exception to this state- 
ment is the middle third of the students on Item 34. This group seems to be 
more variable than expected. It appears that some useful information for 
teachers can be obtained by displaying these proportions for each item in each 
class. 



; Insert Table 28 here 

Table 29 shows the sampling distributions of the modified caution index 
for items. This index is designed to identify items exhibiting unusual re- 
sponses compared to the other items in the test. Since the four items in 
Table 28 are part of a large scale testing program in which the items were 
professionally review, tried-out, and selected, we would not expect high 
values of this caution index in Table ^ . This appears to be upheld. 
Virtually all of the values of the caution index are below 0.55. Thus, none 
of these items would likely have been brought to a teacher's attention as 
unusual in their performance relative to other items in the test. We recommend 
that this index be incorporated into the instructional improvement and guidance 
section of an item analysis program if a microcomputer can handle it. 

Insert Table 29 here 

Table 30 shows the distributions of the frequency chi-squares, SSQ i> which 
test whether the upper scoring group follow a guessing pattern (i.e., a 
uniform distribution). We expect that with SCBSAP items, upper group students 
would not guess. Thus, SSQ^values shoulH be large and the hypotheses of 
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Table 28, Empirical sampling distributions of the proportion of each third of a sample answering an 
item correctly; 
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Table 29, Empirical sampling distributions for the modified caution index 
for items. 
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of a uniform distribution would be rejected. Table 30 shows that the rate 
retention of the hypotheses of a uniform distribution is quite small, 
(Items from grades 1 and 2 have 3 alternatives and items from grade 6 have 
4 alternatives. Thus, the degrees of freedom are 3 arid 4> respectively.) 
Thus, from this preliminary data our original fear of a large Type II error 
rate is not upheld. 



Insert Table 30 here 
SUMMARY 

We have reviewed fifty or so statistics in this report in relation to 
their usefulness for an item analysis microcomputer program that is intended 
to be appropriate for the analysis of domain-referenced classroom tests. We 
took the view that the primary purposes of an item analysis of classroom tests 
are to: (a) inform the teacher about the strengths and weaknesses of the 
class in relation to the skills measured by the individual test items and 
(b) inform the teacher about the items that do not seem to be functioning 
well so that the teacher can rewrite or otherwise revise these items. A 
secondary purpose of a classroom item analysis program is to select items from 
a pool of items (an item bank) to put on a particular test iri order to improve 
the utility of that test for a particular purpose. 

Iri order to provide a context iri which to review item statistics we 
define three broad areas of information a teacher would need in relation 
to test items. Then we specified the particular information needs which item- 
based information can serve under each of these three broad areas arid how 
t^«se particular kinds of information can link together testing and instruction. 
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Table 30, Empirical sampling distribution of the chi-square statistic SSQ ± 
for testing whether students in the upper group responded ran- 
domly to the items 
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Next we considered iri relation to each specific type of information 
several statistical indices which seemed to provide the information required. 
We reviewed each statistic in terms of its statistical and numerical pro- 
perties, its suitability for the type of data likely to be encountered with 
classroom tests, its ability to be understood by teachers, and the practicality 
of computing it on a microcomputer of the type typically found in schools. 
As a result of this analysis, we prepared to each specific type of information 
bur recommendations in relation to each statistic. For each type of infor- 
mation we classified the statistics reviews as (a) basic (to be included iri 
every item analysis program if at all possible) , (b) recommended (useful 
statistics that should be included if the microcomputer has sufficient memory 
and speed and if research shows that teachers can use them) and (c) not 
recommended (for item analysis microcomputer programs that are intended to 
serve the purposes we outlined) . 

In addition to this literature review, we reported some preliminary 
results of an empirical sampling study we are in the process of undertaking 
to study the sampling fluctuations of some of the recommended item analysis 
statistics. The preliminary results of this empirical study indicated that 
the recommendations we made we generally upheld by data from classrooms. 
Further, the empirical results offered guidelines for setting rules of thumb 
for the numerical value of statistics to use when flagging an item and 
bringing it to the attention of the teacher; 
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